Olmo 3: A Practical Overview of Open-Source AI Models, Training Data, and Tooling

Allen AI’s work is helping make advanced AI research more accessible. By lowering the barrier to entry, university labs, independent researchers, and hobbyists can more easily contribute to the next generation of AI systems. This article focuses on Allen AI’s open-source releases, especially Olmo 3. The Olmo 3 release provides broad access to models, datasets, source code, training logs, and live demos. In a field where transparency is often limited, this degree of openness is highly notable.

Prerequisites

This article assumes that you already have a basic understanding of LLM training concepts such as pretraining and post-training. For a broader introduction to LLM training, The Smol Training Playbook from Hugging Face is a useful reference.

The purpose of this article is to provide a compact overview of Olmo 3 so that you can quickly move toward practical implementation. Presenting the material in tables should offer a clear, high-level view of the release, which is also covered in greater detail in the following resources:

  • Olmo 3 and the Open LLM Renaissance by Cameron R. Wolfe
  • Olmo 3: Charting a path through the model flow to lead open-source AI | Ai2, Allen AI’s Olmo 3 launch post

You can also consult the Olmo 3 Technical Report together with this article to gain more context on the model specifications and training pipeline. Knowledge of Olmo 2 can be helpful, since Olmo 3 continues the work started in the previous version.

Key Takeaways

  • The Olmo 3 base model was pretrained on a broad text corpus called Dolma 3 Mix, then mid-trained with targeted, high-quality data from Dolma 3 Dolmino, and finally extended for longer context using Dolma 3 Longmino.
  • The post-trained model family includes Olmo 3 Instruct, Olmo 3 Think, and Olmo-3 RL-Zero.
  • AI2 integrates Olmo 3 with OlmoTrace, a tool that helps connect model outputs back to specific examples from the pretraining data.
  • The model was pretrained with Dolma 3 and post-trained using the Dolci suite.

Model Architecture

The following table summarizes important architectural characteristics of Olmo 3.

Spec Relevance
7B and 32B Parameters Olmo 3 is offered in two model sizes: 7B and 32B parameters. In the architecture figure, the 7B model uses the same number of Query (Q) and Key-Value (KV) heads, while the 32B model has many more Q heads than KV heads. This is because the 32B variant uses grouped query attention (GQA), whereas the 7B model uses multi-head attention (MHA). For a basic explanation of attention mechanisms, see the attention and variants section in an LLM inference optimization article. The 7B model is compact enough for high-end consumer GPUs, while the 32B model can run on a single research node.
Dense Transformer Although many open-weight models have recently used Mixture of Experts architectures, such as Kimi-K2 and gpt-oss, Olmo 3 is built as a dense decoder-only transformer.
Sliding Window Attention (SWA) The researchers use a sliding window attention pattern to support scalable pretraining with longer sequence lengths while keeping inference costs manageable. With this method, each token attends to previous tokens within a 4096-token window. SWA is applied to three out of every four layers, while the final layer always uses full attention.
Rotary Position Embeddings Θ = 5e5 RoPE represents positional information by rotating query and key vectors according to the position of each token. Position encoding is essential because attention itself does not inherently understand token order. Figure 13 in the Olmo 3 paper shows how the RoPE theta value of 500K is the main factor contributing to performance on the RULER benchmark.
YaRN YaRN, short for Yet another RoPE-scaling method, is a compute-efficient approach for extending the context length of transformer models. The researchers tested several techniques for extending RoPE beyond the original pretraining context length, as described in section 3.6.4. They found that applying YaRN only to full attention layers delivered the best results.

Data Curation

Dataset Name Size Description & Purpose
Dolma 3 ~9.3 trillion tokens The complete corpus, collected from web pages, scientific PDFs, code repositories, math problems, and encyclopedic sources.
Dolma 3 Mix 5.9 trillion tokens (~6T) A pretraining mixture derived from Dolma 3. It includes higher shares of code and math data and uses strong decontamination and deduplication. allenai/olmo-3-pre-training
Dolma 3 Dolmino 100 billion tokens The mid-training dataset built from Dolma 3. It emphasizes high-quality math, science, code, and reading comprehension data to strengthen targeted skills before final tuning. allenai/dolma3_dolmino_pool
Dolma 3 Longmino ~50 billion tokens The long-context dataset derived from Dolma 3. It combines long documents from a 639B-token pool with mid-training data so the model can follow information across long inputs of up to 65K tokens. allenai/dolma3_longmino_pool
Dolci Suite Variable, depending on the mix The post-training data suite. It includes separate mixtures for SFT, reasoning and tool use, DPO, contrastive preference learning, and RLVR with verifiable rewards. allenai/olmo-3-post-training
Function / Stage Type Description
Pretraining Pretraining The first phase consists of three parts: broad capability learning, mid-training for skill refinement, and long-context extension.
SFT Post-Training Supervised Fine-Tuning. This stage shapes the model’s raw outputs into specific formats, such as chat responses or step-by-step reasoning.
DPO Post-Training Direct Preference Optimization. This tuning method teaches the model from preference data by learning to select better responses over weaker ones.
RLVR Post-Training Reinforcement Learning with Verifiable Rewards. This specialized reinforcement learning stage encourages high-quality reasoning traces by rewarding outcomes that can be verified, such as correct math or code results.

OlmoTrace

OlmoTrace allows users to highlight text and trace it back to the corresponding source in the training data. This makes it useful for auditing hallucinations, identifying contamination, separating reasoning from memorization, and studying scaling laws by observing how reasoning develops with more data and compute.

Running Olmo 3 on General Cloud GPU Infrastructure

General cloud GPU servers can be used to experiment with these models.

The Olmo 3 blog post includes an interactive figure that shows the training stage together with the related datasets.

Tool Description
Olmo-core A modern framework for distributed model training. It is a pretraining codebase designed for high efficiency. Docs: OLMo-core v2.4.0
Open Instruct A post-training pipeline.
datamap-rs A pure-Rust toolkit for cleaning large-scale datasets.
duplodocus A tool for highly efficient fuzzy deduplication.
OLMES A toolkit for reproducible evaluations. It includes OlmoBaseEval, which was used during Olmo 3 base model development.
decon A tool that removes test sets from training data.

FAQ

Why did the researchers use hybrid sharded data parallel (HSDP)?

The Olmo 3 team used Hybrid Sharded Data Parallel, or HSDP, mainly to improve training efficiency and scalability. HSDP combines Fully Sharded Data Parallelism inside each node with standard Data Parallelism across nodes. This reduces communication overhead between nodes, which becomes especially important at larger scales, and enables more efficient synchronization of parameters and gradients during model updates. By keeping the most communication-heavy operations within each node, HSDP helps large models such as Olmo 3 Base scale more effectively and train faster.

How was data curated for Olmo 3?

Stage Key Data Sources Processing Highlights Goal
Pretraining Common Crawl [A.2.1], olmOCR PDFs, Stack-Edu code data, arXiv, FineMath, Wikipedia, and Wikibooks. Deduplication using hash and MinHash methods [A.2.2], quality filtering with fastText, token-constrained mixing, and upsampling of high-quality data. Create a diverse, high-quality foundation using 6T tokens.
Midtraining Synthetic math data such as TinyMATH and CraneMath, code data from Stack-Edu and Nemotron, QA data from Reddit-to-Flashcards, and reasoning traces. Microanneals for dataset testing, integration tests, decontamination, and deliberate inclusion of instruction and thinking data. Improve math, code, reasoning, and QA abilities using 100B tokens.
Long-Context Extension olmOCR PDFs containing long documents and synthetic aggregation tasks. Document filtering with gzip, packing, intra-document masking, and YaRN for positional embeddings. Enable a 65K-token context window with long-form documents using 50B to 100B tokens.
Post-Training (Think) Reasoning traces from OpenThoughts3 and SYNTHETIC-2, math, code, and chat prompts, and DPO pairs [4.3.1] from Qwen3 models. SFT, DPO, and RL stages, verifiable rewards, and delta-learning for contrastive pairs. Optimize the model for reasoning in math, code, and chat, as well as accurate instruction-following.
Post-Training (Instruct) Function-calling data, WildChat, precise instruction-following prompts, multi-turn DPO data, and length-controlled responses. Emphasis on usability, tool use, and concise outputs, with RL for general chat and function calling. Optimize the model for chat usability, tool integration, and shorter responses.
Post-Training (RL-Zero) Filtered math data from DAPO and Omega, code, instruction-following and chat subsets, and decontaminated evaluations. Reinforcement learning from scratch using verifiable rewards and simple prompt templates. Benchmark reinforcement learning algorithms with transparent, contamination-free data.

How did the researchers make RL training 4x more efficient?

The researchers improved reinforcement learning training efficiency by using in-flight weight updates, also known as pipeline RL, continuous batching with dynamic prompt replacement to reduce GPU idle time, and several threading improvements.

Final Thoughts

Olmo 3 is a remarkable release. Its three-stage training pipeline, consisting of pretraining on Dolma 3 Mix, midtraining on Dolma 3 Dolmino, and long-context extension on Dolma 3 Longmino, resulted in a family of post-trained models. These include Instruct, Think, and RL-Zero, each optimized for different capabilities. The open access to models, datasets, code, and training logs makes it especially interesting to see how researchers and practitioners will use this release.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

n8n Workflow Automation: Open-Source Guide

AI/ML, Tutorial
Vijona3 minutes ago n8n Workflow Automation: Open-Source Automation for Apps, APIs, and Services Automation has become an essential part of modern software development and IT operations. Whether teams need to…
Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

Web Grounding for LLMs with Python

AI/ML, Tutorial
Vijona58 minutes ago How to Add Web Grounding to Large Language Model Responses with Python When you send questions about recent or upcoming events to a large language model, the…
Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

QwenLong-L1.5: Long-Context AI Reasoning

AI/ML, Tutorial
Vijona1 hour ago QwenLong-L1.5: Long-Context Reasoning with Memory-Augmented AI Large Language Models (LLMs) are advancing quickly in reasoning capabilities, but long-context reasoning continues to be one of the most difficult…