OpenAI gpt-oss: Architecture, Quantization, Tokenizer, and Resources

We’re amazed by the open-source model launches we’ve been seeing—like Kimi K2, Qwen3 Coder, and GLM-4.5 —especially considering the impressive leaps these agentic models show through smart multi-step reasoning, programming ability, and effective tool usage.

With gpt-oss, OpenAI has shipped its first major open-source model release in more than five years, following GPT-2 back in 2019. This model family is released under the permissive Apache 2.0 license, meaning you generally have wide latitude to use, change, and redistribute it—even for commercial purposes—so long as you keep the original license and copyright notices and clearly note any modifications.

Model Variants and Hardware Requirements

The model is offered in two sizes: 120B and 20B. The 120B version includes 117 billion total parameters (with 5.1 billion active per token) across 36 layers. The 20B version contains 21 billion total parameters (with 3.6 billion active per token) and 24 layers. Both models use native 4-bit (MXFP4) quantization for their Mixture-of-Experts weights, which enables the 120B model to fit on a single 80 GB GPU and lets the 20B model run with roughly 16GB of memory.

Key Takeaways

  • OpenAI launches gpt-oss, its first substantial open-source model release since GPT-2 (2019), under the Apache 2.0 license
  • Two variants: 120B and 20B, both using 4-bit (MXFP4) quantization for MoE weights; the 120B model fits on a single 80GB GPU
  • Core architecture elements: MoE, Gated SwiGLU, GQA, SWA, RoPE, expanded context via YaRN, Attention Sinks
  • Quantization: Native Microscaling FP4 (MXFP4), with MoE weights at 4.25 bits per parameter
  • Tokenizer: o200k_harmony (a BPE-style tokenizer), available via TikToken
  • Post-training emphasis: Reasoning, tool use (browsing, Python, developer functions), and safety via CoT RL
  • Uses the Harmony Chat Format to keep training and deployment consistent

Model Architecture

We’ll begin by exploring the model’s architecture. We’ve summarized the key specifications and why they matter in a chart to make the information easier to absorb.

Specifications and Relevance

Spec Relevance
Mixture of Experts In gpt-oss, the Mixture of Experts (MoE) design uses sparse Feedforward Neural Network (FFN) layers called experts, plus a gating mechanism (router) that sends tokens to the top-4 experts. As a result, only part of the model’s parameters activate per token. This approach is attractive because it offers a compute-efficient alternative to fully dense models.
Gated SwiGLU activation function Activation functions add non-linearity, allowing the network to learn complex patterns. The MoE blocks in gpt-oss use a gated SwiGLU activation function—SwiGLU is widely regarded as the standard for modern LLMs. The gpt-oss model card notes that its SwiGLU implementation is unusual because it includes clamping and a residual connection. These changes likely support smoother optimization and quicker convergence, particularly in large transformer systems. Residual (skip) connections create shortcut paths so a layer’s input can be added directly to its output, bypassing intermediate transformations.
Grouped Query Attention (GQA) Sliding Window Attention (SWA) The model card describes this as alternating fully dense and banded window pattern attention blocks. In practice, this means the attention layers alternate between grouped query attention and sliding window attention. Gpt-oss uses 8 key-value heads. Each attention head includes a learned bias in the softmax denominator, resembling off-by-one attention.
Rotary Position Embeddings RoPE encodes position by rotating the query and key vectors according to each token’s position. Position encoding is crucial because attention mechanisms do not inherently understand token order.
Context Length of dense layers = 131 072 The dense layers support a 131,072-token context window through YaRN. YaRN (Yet another RoPE-scaling method) is a compute-efficient technique designed to extend the context range of transformer models.
Attention Sinks Attention sinks are tokens inserted at the beginning of a sequence to stabilize attention, which is particularly helpful for long-context use cases.

Quantization

The quantization approach here is especially interesting. The gpt-oss models are trained natively using Microscaling FP4 (MXFP4), where the MoE weights (about 90% of the total parameter count) are quantized to 4.25 bits per parameter. To understand microscaling more deeply, we recommend reading the OCP Microscaling Formats (MX) Specification Version 1.0, focusing on section 5.

Tokenizer

The o200k_harmony tokenizer is used throughout all training stages. It’s a BPE-style tokenizer with a 200k-token vocabulary. This tokenizer is open source, available in TikToken, and is built on top of the o200k tokenizer used in other OpenAI models.

Post-training Focus

Post-training for gpt-oss centers on reasoning, tool use (browsing, Python, and developer functions), safety using CoT RL techniques, and the Harmony Chat Format. As far as we know, the datasets and RL environments used for this model have not been released.

OpenAI Harmony Chat Format

Chat templates matter for several reasons. Keeping the chat format consistent between training and deployment helps prevent performance drop-offs. Like tokenizers, chat templates define how data is processed. OpenAI uses a custom chat format while training gpt-oss, called the harmony chat format, which includes special tokens to mark message boundaries and role tags such as User, Assistant, System, and Developer.

The model resolves conflicts using a role hierarchy of System, Developer, User, Assistant, and Tool, and it uses channels to control which information is visible for analysis, commentary, and final output.

This design supports advanced agentic behavior, including interleaved tool calls.

Additional Resources

Conceptual Overviews

The Illustrated GPT-OSS – by Jay Alammar: The visuals in this piece are outstanding for building intuitive understanding of both gpt-oss architecture and the message format it relies on. We especially like how the article explains how different user roles (such as chatGPT end-users, LLM app builders like cursor, or people doing post-training) are supported by how the format shapes inputs and outputs (for example, reasoning traces and tool interactions) within the model.

From GPT-2 to gpt-oss: Analyzing the Architectural Advances – by Sebastian Raschka: This article is excellent because it highlights how far things have progressed since GPT-2. It provides detailed, thorough explanations of concepts such as RoPE, SwiGLU, Mixture of Experts, GQA, SWA, and RMSNorm.

SwiGLU

What is SwiGLU? by jcarlosroldan: This piece gives useful background on why SwiGLU has become the go-to activation function in modern LLMs.

Chat Format

Chat Templates: An End to the Silent Performance Killer : Chat templates are Jinja-style templates embedded in tokenizers that automatically format conversations into the structure the model was trained on. This article explains that if you don’t format prompts exactly as expected, performance can degrade silently—not necessarily with errors, but with worse outputs.

Remember that gpt-oss uses the Harmony chat format.

Microscaling

OCP Microscaling Formats (MX) Specification Version 1.0 : This resource expands on microscaling formats from the Open Compute Project (OCP). Section 2 explains how MX formats align with OCP’s core principles: they are open and jointly developed by major industry participants and based on prior open standards; efficient, enabling reduced precision and memory usage for lower cost and improved performance; impactful, backed broadly enough to likely become an industry standard; scalable, designed to be adopted on existing hardware; and sustainable, reducing energy use and carbon emissions in AI workloads.

GitHub – microsoft/microxcaling: PyTorch emulation library for Microscaling (MX)-compatible data formats : This GitHub project emulates MX-compatible formats and bfloat quantization in PyTorch. While computations run using float32/bfloat16/fp16, the library respects the representable ranges of MX or bfloat formats. It supports matrix multiplication (torch.matmul, torch.linear, torch.bmm) for MX tensors, as well as element-wise operations such as GELU, softmax, and layernorm, where basic ops (including add, sub, sqrt, exp) are executed with bfloat precision.

Microscaling Data Formats for Deep Learning: This is the paper that originally introduced microscaling data formats.

1.5x Faster MoE Training on Blackwell with MXFP8 Kernels Built from Scratch | Cursor – The AI Code Editor: This article explains how Cursor achieved a 1.5x speedup in end-to-end training of large language models on Blackwell GPUs. By rebuilding Mixture-of-Experts layers using custom MXFP8 kernels, they reduced training time and costs, accelerating progress and SOTA deployment.

Note that gpt-oss uses MXFP4 rather than MXFP8 for the linear projection weights in the MoE layer.

Implementations

Fine-tuning with gpt-oss and Hugging Face Transformers: “On a H100 GPU, this takes about 18 minutes to train, but may take longer depending on your hardware.”

Ollama (gpt-oss 20b, ~14GB of VRAM): “Ollama is supporting the MXFP4 format natively without additional quantizations or conversions. New kernels are developed for Ollama’s new engine to support the MXFP4 format. Ollama collaborated with OpenAI to benchmark against their reference implementations to ensure Ollama’s implementations have the same quality.”

Unsloth (gpt-oss 20b, ~14GB of VRAM): “We utilized OpenAI’s Triton Kernels library directly to allow MXFP4 inference. For finetuning / training however, the MXFP4 kernels do not yet support training, since the backwards pass is not yet implemented. We’re actively working on implementing it in Triton!”

More implementations are linked from the gpt-oss repository.

There are plenty of excellent resources out there, so feel free to leave a comment about anything that should be included.

Final Thoughts

While researching this article, we were impressed by just how much content appeared around gpt-oss—news coverage, YouTube videos, blog posts, community-built base models, and more. It’s obvious that the community is highly energized by OpenAI releasing open-source models, and we’re looking forward to seeing how these models are adopted, how they compare to similar alternatives, and what developers build with them.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: