Qwen3-Coder: An Agentic MoE Coding Model With 405B Parameters

There have been a wave of Qwen launches lately. One of the most notable is Qwen3-Coder, an agentic Mixture of Experts (MoE) model featuring 405B total parameters and 35B active parameters, built for high-end coding help and multi-turn tool usage. The short gap (less than two weeks) between Kimi-K2 and the arrival of Qwen3-Coder highlights just how aggressively teams are delivering specialized open-weight, agentic coding models directly to developers. What helps this model stand out is its smaller overall size (compared to Kimi K2’s 1 trillion parameters) alongside strong benchmark results.

Qwen3 launched in May of this year, and in the closing section of its technical report, they state: “we will work on improving model architecture and training methods for the purposes of effective compression, scaling to extremely long contexts, etc. In addition, we plan to increase computational resources for reinforcement learning, with a particular emphasis on agent-based RL systems that learn from environmental feedback.”

In July, the refreshed Qwen3 model introduced updated pretraining and reinforcement learning (RL) stages using a revised form of Group Relative Policy Optimization (GRPO) called Group Sequence Policy Optimization (GSPO), along with a scalable setup capable of running 20 000 independent environments simultaneously. We’re very excited (for the release of an updated technical report?) to learn more about the specifics.

Key Takeaways

  • 405B parameter Mixture of Experts model with 35B active parameters
  • 160 experts with 8 active per token
  • 256K token context length extendable to 1M with YaRN
  • High SWE-bench verified score on long horizon tasks (69.6 with 500 turns vs Claude-Sonnet-4 at 70.4% with 500 turns)
  • Trained with Group Sequence Policy Optimization
  • Smaller 30B A3B Instruct variant runs on a single H100 GPU
  • Qwen Code CLI open-sourced as a fork of Gemini CLI

Here’s a high level overview to get you up to speed with Qwen3-Coder’s internals.

Model Overview

Spec Relevance
Mixture of Experts (MoE) The Mixture of Experts (MoE) design enables higher model scale and quality while cutting compute requirements. It relies on sparse Feedforward Neural Network (FFN) layers called experts, plus a gating mechanism that routes tokens to the top-k experts, meaning only part of the model’s parameters are used per token.
405B parameters, 35B active parameters Because Qwen3-Coder uses MoE, it has both total and active parameter counts. “Total parameters” refers to the full sum of parameters across the entire model, including every expert, the router or gating network, and shared components—regardless of which experts are actually used during inference. This differs from “active parameters,” which describes the subset engaged for a given input, typically the chosen experts plus shared components.
Number of Experts =160, Number of Activated Experts = 8 This is very interesting because (click link).
Context length = 256K tokens natively, 1M with YaRN YaRN (Yet another RoPE extensioN method), is a compute-efficient technique for extending the context window of transformer-based language models. In Qwen3-Coder, it pushes the context length up to one million.
GSPO (Group Sequence Policy Optimization) In Qwen’s recent paper, they present GSPO with results suggesting better training efficiency and performance than GRPO (Group Relative Policy Optimization). GSPO stabilizes MoE RL training and may make RL infrastructure design simpler.

On benchmarks, Qwen3-Coder’s performance is impressive with its score of 67.0% on SWE bench verified – which increases to 69.6% with 500 turns. The 500-turn result simulates a more realistic coding workflow – where the model can read feedback (like test failures), modify code, rerun tests, and repeat until the solution works.swe-bench

Implementation

This article will include implementation details for a smaller variant, Qwen3-Coder-30B-A3B-Instruct. For those curious about the name of this variant, there are 30 billion total parameters and 3 billion active parameters. The instruct indicates it’s an instruction-tuned variant of the base model.

Implementation Specs

  • Number of Parameters: 30.5B total, 3.3B activated
  • Number of Layers: 48
  • Number of Attention Heads (GQA): 32for Q, and 4 for KV
  • Number of Experts and Activated Experts: 128 experts, 8 activated experts
  • Context Length: 262,144 native context (without YaRN)

So as we can see this particular model has slightly different specs, but can run on a single H100 GPU.

Step 1: Set up a GPU Virtual Machine

Step 2: Web Console

After your GPU Virtual Machine is created, you can open the Web Console.

Step 3: Install Dependendencies

apt install python3-pip
pip3 install transformers>=4.51.0

Step 4: Run the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-Coder-30B-A3B-Instruct"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Write a quick sort algorithm."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=65536
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Qwen Code: Open-Source CLI

Qwen Code is an open-source command-line interface that enables developers to work with the Qwen3-Coder model on agentic coding tasks. It is a fork of the Gemini CLI, adapted to integrate smoothly with Qwen3’s capabilities.

We’ve included the steps to install the CLI, set it up, and run it with the Qwen3-Coder model.

Step 1: Install Node.js (Version 20 or Later)

Before you begin, make sure you have Node.js20+ installed on your device. In your terminal:

Step 2: Install Qwen Code CLI

Once Node.js is ready, install Qwen Code globally:

This makes the qwen-code command available from anywhere on your system.

Step 3: Get an API Key

Get an API key from openAI

export OPENAI_API_KEY="your_api_key_here"
export OPENAI_BASE_URL="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
export OPENAI_MODEL="qwen3-coder-plus"

Step 4: Vibe Code

qwencliType qwen in your terminal and you’ll be able to vibe code.

For alternate ways to use Qwen3-Coder, check out the Qwen Coder blog post.

Qwen3 From Scratch

Here’s a notebook that may be of interest to those who want to improve their intuition around Qwen3’s underlying architecture.

Implement Qwen3 Mixture-of-Experts From Scratch by Sebastian Raschka: “this notebook runs Qwen3-Coder-30B-A3B-Instruct (aka Qwen3 Coder Flash) and requires 80 GB of VRAM (e.g., a single A100 or H100).”

Final Thoughts

We’re very excited to see the community experiment with these open-weight agentic coding models such as Qwen3-Coder, Kimi K2, Devstral, and integrate them in their workflows. What we’re most impressed about with Qwen3-Coder is its context window. At 246K tokens, extendable to a million, we’re excited to see how effective this model is in real-word software engineering use cases in comparison to alternative open-weight models. With its impressive context window, availability of accessible smaller variants with Qwen3-Coder-30B-A3B-Instruct, and the introduction of the Qwen Code CLI, this model is poised to empower developers with powerful, agentic coding assistance.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: