Kimi K2.5: Architecture, Training, Performance, and Deployment Guide

At the start of the year, Moonshot AI unveiled another remarkable release: Kimi K2.5. This visual agentic intelligence model ranks strongly in popularity on OpenRouter, which points to broad adoption, and it surpasses closed-source models across multiple benchmarks, suggesting meaningful research progress. From the perspectives of architecture, training methodology, and implementation, this model is clearly worth examining.

Beyond shipping standout models, Moonshot AI also publishes highly detailed technical reports packed with useful information. Alongside this article, the Kimi-K2.5 technical report is essential reading.

The Kimi-K2.5 release includes post-trained checkpoints and is made available under a Modified MIT license.

This article focuses on the aspects we found most compelling. More specifically, it looks at what the Kimi K2 team actually did to produce such strong results. It also explains how to run the model on a cloud GPU instance.

Key Takeaways

  • Kimi K2.5, like Kimi K2, uses a Mixture-of-Experts architecture with 1 trillion total parameters and 32 billion active parameters. It is likely called K2.5 rather than K3 because it extends K2 through large-scale joint pre-training on 15 trillion visual and textual tokens.
  • The biggest difference between Kimi K2 and K2.5 is the stronger focus on joint vision training, especially during pre-training and the reinforcement learning stage of post-training. Supervised fine-tuning remains text-only.
  • The model is released under a Modified MIT license and includes post-trained checkpoints. It is offered in three operating modes: instant mode, thinking mode, and agent mode.
  • Agent Swarm and PARL, or Parallel Agent Reinforcement Learning, are introduced to address the limited capacity of a single agent when dealing with complex scenarios.
  • The Toggle heuristic improves token efficiency during reinforcement learning by alternating between inference-time scaling and budget-constrained optimization.
  • The Decoupled Encoder Process, or DEP, manages the load imbalance and memory fluctuations that arise when visual inputs of different sizes, such as images and videos, are processed together with text.
  • For more advanced tasks, Kimi K2.5 can coordinate an agent swarm with as many as 100 sub-agents, enabling parallel workflows across up to 1,500 tool calls. These sub-agents are specialized for roles such as AI Researcher, Physics Researcher, and Fact Checker.

Model Overview

Architecture: Transformer, Mixture-of-Experts (MoE)

The Mixture-of-Experts architecture makes it possible to increase model scale and quality while keeping compute costs lower than they would otherwise be. It relies on sparse Feedforward Neural Network layers, known as experts, together with a gate network, or router, that selectively sends tokens to the top-k experts. Because only a subset of parameters is activated for each token, the architecture can grow much larger without causing compute costs to rise proportionally.

Parameters: 1 Trillion Total Parameters, 32 Billion Active Parameters

Since K2 uses a MoE architecture, it has both total and active parameter counts. Total parameters refer to the sum of all parameters across the full model, including every expert network, the routing or gating network, and all shared components, regardless of whether they are used during inference. Active parameters, by contrast, refer only to the subset used for a particular input, which usually includes the activated experts along with the shared components.

Attention Mechanism: MLA (Multi-head Latent Attention)

MLA was introduced in DeepSeek V2, specifically in Section 2.1, as an attention mechanism designed to improve inference efficiency. deepseek v2MLA works by compressing the attention input into a low-dimensional latent vector, from which the keys and values can later be reconstructed. Because K2 uses MLA, QK-Norm, a normalization method typically applied to query-key matrices, cannot be used when scaling Muon training, since the key matrices in MLA are not fully materialized during inference. To address this, the K2 researchers added QK-Clip, a weight-clipping method that constrains the attention logits that appear during large-scale Muon-optimized training.

Optimizer: MuonClip

Muon is a token-efficient optimizer, but it needs adjustments for large-scale training. MuonClip, introduced in Section 2.1 of the Kimi K2 technical report, extends Muon by integrating weight decay, consistent RMS matching, and QK-Clip.

Number of Experts: 384 ; Selected Experts per Token: 8 ; Number of Shared Experts: 1

To better understand this design choice, it is helpful to revisit the sparsity discussion from the earlier Kimi K2 analysis, particularly how increasing the total number of experts leads to greater sparsity.

Number of Layers: 61 (Including 1 Dense Layer)

“layers” describes how many transformer blocks the model contains. These blocks progressively process the input and help the model build more abstract internal representations. A dense layer, by contrast, connects every input unit with every output unit.

Number of Attention Heads: 64 ; Attention Hidden Dimension: 7168

Attention heads enable the model to focus on different parts of the input at the same time. Each head learns to capture different types of relationships within the data.

MoE Hidden Dimension (per Expert): 2048

Each expert processes a 2048-dimensional representation.

Activation Function: SwiGLU

This is not particularly surprising. SwiGLU has become the standard activation choice in modern large language models. ex: gpt-oss

Vision Encoder: MoonViT-3D (400M Parameters)

This is a new addition compared with Kimi K2. Anyone familiar with Kimi-VL will likely recognize MoonViT. Kimi K2.5 uses MoonViT-3D, which is a continual pre-train of SigLIP on image-text and video-text pairs. In this design, consecutive frames are grouped in sets of four, passed through the shared MoonViT encoder, and then temporally averaged at the patch level. This allows K2.5 to process videos that are four times longer within the same context window.

Key topics covered in the paper

The paper explores three tightly connected themes:

  • Vision-language integration through joint optimization methods that let text and vision improve each other. Both the pre-training and reinforcement learning stages are multimodal.
  • Scalable parallelism through Agent Swarm, which supports the concurrent execution of heterogeneous subtasks by specialized agents.
  • Reinforcement learning, which is used in several different ways throughout the model. These are examined in more detail later in the article:
    • Joint multimodal RL
    • Outcome-based visual RL
    • PARL (Parallel Agent Reinforcement Learning)

The paper also describes inference optimization techniques that reduce latency by as much as 4.5× while simultaneously improving task performance. Thanks to these parallelization-based inference gains, Kimi K2.5 can handle videos up to four times longer in the same context window while still maintaining full weight sharing between image and video encoders.

Agent Swarm

With Agent Swarm, the system includes:

  • dynamic task decomposition
  • subagent instantiation
  • parallel subtask scheduling

On the Kimi website, K2.5 Agent Swarms can be tested directly.

Section 5.2 of the K2.5 technical report shows how the design is reflected in benchmark results. The Agent Swarm framework is assessed using three benchmarks: BrowseComp, which focuses on difficult web browsing and deep reasoning; WideSearch, which targets large-scale retrieval; and an internal Swarm Bench designed around real-world complexity. This internal benchmark evaluates orchestration, scalability, and coordination across four different domains. A notable point is its focus on scaling tasks such as information collection, downloading, interpretation, and writing.

In-House Swarm Bench Tasks

WildSearch:

  • Unrestricted collection of information from the entire internet without limitations.

Batch Download:

  • Large-scale acquisition of many different file types and materials.

WideRead:

  • Processing and understanding substantial volumes of text across more than 100 documents.

Long-Form Writing:

  • Producing lengthy, well-organized written content extending beyond 100,000 words.

PARL

In K2.5, Parallel Agent Reinforcement Learning (PARL) refers to an approach in which the system learns how to distribute work in parallel by using feedback from the environment and reinforcement learning exploration. This process is handled by a trainable orchestrator agent. Efficiency is improved by training this orchestrator together with smaller sub-agents and by adapting the ratio of inference instances dynamically.

The researchers also describe a failure case called serial collapse. In this situation, the orchestrator falls back to using only one agent even though parallel resources are available. PARL counters this with staged rewards: early training rewards encourage parallel execution, while later stages place more weight on completing the task successfully.

Post-Training

Supervised Fine-Tuning

It may seem surprising that this stage is text-only. The researchers found that including human-designed visual trajectories during supervised fine-tuning harms generalization. By contrast, text-only SFT delivers better performance, which the researchers believe is because joint pre-training already establishes vision-text alignment in a way that promotes generalization.

The synthetic data generation pipeline creates high-quality candidate text responses using K2, K2 thinking, and a set of proprietary internal expert models. These internal models are especially intriguing. The resulting instruction-tuning dataset contains diverse prompts and emphasizes reasoning and tool-calling capabilities.

Reinforcement Learning

What makes this reinforcement learning setup different from more traditional approaches is that the RL domains are not organized by input modality, such as image or text, but by capability, such as knowledge, reasoning, coding, or agentic behavior.

Unified Agent Reinforcement Learning Environment

To reduce the overhead involved in customizing and implementing environments, the system uses a standardized Gym-like interface with pluggable components such as the toolset, judge, and prompt and instruction enhancement modules.

  • Toolset: Supports a variety of tools together with sandboxes.
  • Judge: Provides multi-dimensional reward signals.
  • Prompt Diversification and Instruction-Following Enhancement: Improves instruction following while diversifying prompts.

Performance

Section 5 of the Kimi K2.5 technical report examines the model’s performance in detail. Based on those results, K2.5 appears especially strong in the following areas:

  • Reasoning and general capability
  • Complex coding and software engineering
  • Agentic capabilities
  • Vision, reasoning, knowledge, and perception
  • Video understanding
  • Computer-use capability

Running K2.5 on a Cloud GPU Instance

There are several ways to run different versions of Kimi K2.5, including vLLM, SGLang, and Unsloth. Keep the memory requirements in mind: the 1T parameter hybrid reasoning model needs 600GB of disk space, while the quantized Unsloth Dynamic 1.8-bit version reduces that to 240GB, which is a 60% reduction in size: Kimi-K2.5-GGUF

Start by provisioning a cloud GPU instance and connecting to it over SSH. Be sure to plan for the number of GPUs required by your deployment approach.

vLLM Implementation

This setup follows the referenced usage guide.

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

In this example, -tp is set to 1 so the model’s individual layers and mathematical operations are split into shards across a single GPU. The original documentation uses -tp 8 in order to distribute the model across 8 GPUs through 8-way tensor parallelism.

vllm serve $MODEL_PATH -tp 1 --mm-encoder-tp-mode data --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2

SGLang Implementation

This setup follows the SGLang implementation described in the Kimi-K2.5 deployment guide.

pip install "sglang @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
pip install nvidia-cudnn-cu12==9.16.0.29

sglang serve --model-path $MODEL_PATH --tp 8 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2

Key parameter notes:

  • --tool-call-parser kimi_k2: Required when tool usage is enabled.
  • --reasoning-parser kimi_k2: Required for correctly handling reasoning content.

FAQ

Why is the model called K2.5 and not K3?

K2.5 is built directly on top of the K2 base and extended through large-scale joint pre-training on 15 trillion visual and text tokens. Since the core architecture, including the MoE design, parameter counts, and MuonClip optimizer, remains the same, the team presents it as an evolution of K2 rather than a completely new generation.

Why does early vision fusion with a lower vision ratio outperform aggressive late-stage vision injection?

The paper’s ablation studies show that introducing vision data early at a modest 10:90 vision-to-text ratio consistently performs better than late fusion at a 50:50 ratio. Late fusion produces a dip-and-recover pattern in which text performance initially declines because of modality domain shift. Early fusion avoids that disruption and allows both modalities to build unified representations from the beginning.

Why does visual RL improve text performance?

The paper reports that outcome-based visual RL improves scores on MMLU-Pro, GPQA-Diamond, and LongBench v2. A plausible explanation is that visual tasks involving counting, OCR, and structured extraction improve calibration and reduce uncertainty in related text-based reasoning patterns.

Why is SFT text-only if K2.5 is a multimodal model?

Adding human-designed visual trajectories during the SFT stage was found to reduce generalization. Because the joint pre-training stage already creates strong vision-text alignment, text-only SFT is enough to activate visual reasoning without increasing the risk of overfitting to low-diversity visual demonstrations. The paper refers to this as “Zero-Vision SFT.”

How does Toggle prevent models from becoming too token-efficient at the cost of reasoning quality?

Toggle switches between two training phases after every m iterations. In one phase, the model is encouraged to reason concisely under a token budget; in the other, it can use the full token allowance. This design reduces the risk that the model becomes too dependent on short outputs and then cannot benefit from additional compute on harder tasks. In practice, Toggle lowers token usage by around 25 to 30 percent while keeping performance nearly unchanged.

How does Agent Swarm differ from simply calling tools in parallel?

Training the orchestrator and the sub-agents in parallel makes it difficult to determine which component deserves credit for the result. A correct answer may still include weak contributions from individual sub-agents, while an incorrect answer does not automatically mean that every sub-agent failed. To avoid this uncertainty, the team kept the sub-agents fixed and used their outputs as observations from the environment. This allowed the orchestrator to be trained more reliably while keeping coordination decisions separate from execution at the sub-agent level.

Why are sub-agents frozen during PARL training?

Training both the orchestrator and the sub-agents at the same time introduces ambiguity in credit assignment. A correct final answer does not necessarily mean every sub-agent performed well, and the reverse is also true. By freezing the sub-agents and treating their outputs as environmental observations, the team could train only the orchestrator in a stable way, separating high-level coordination from low-level execution.

What is serial collapse and how is it addressed?

Serial collapse occurs when the orchestrator learns to default to single-agent execution even though parallel capacity is available. In other words, it chooses the path of least resistance. PARL addresses this with an instantiation reward, rparallel, that explicitly encourages sub-agent creation early in training. This auxiliary reward is then gradually annealed to zero so that the model ultimately optimizes for successful task completion rather than parallelism for its own sake.

What does it mean for hyperparameters to be annealed to 0? (See Section 3 where it covers the PARL reward)

In the context of Kimi K2.5 Agent Swarm training, annealing hyperparameters to 0 means gradually reducing the weights of auxiliary rewards throughout the reinforcement learning process.

  • Initial Phase: The weights λ1 and λ2 are set above zero so the model receives “training wheels” that encourage exploration of parallel execution through rparallel and ensure that sub-tasks are actually completed through rfinish.
  • Transition: These values are lowered over time so that the model does not learn to reward-hack or prioritize concurrency over output quality.
  • Final Phase: Once the weights reach 0, the model is optimized purely for the main objective, which is successfully solving the task through rperf.

What is spurious parallelism and how is it prevented? (See Section 3 where it covers the PARL reward)

Spurious parallelism describes a form of reward hacking in which the orchestrator launches many sub-agents even though the task has not been meaningfully divided. The goal is simply to make the parallelization metrics look better.

This is addressed through three mechanisms:

  • The rfinish-reward promotes successful completion of the assigned subtasks, helping ensure that any decomposition is practical and valid.
  • The Critical Steps metric focuses on the longest execution path instead of the total number of steps. As a result, creating many unnecessary subtasks offers no benefit if it does not shorten latency.
  • Hyperparameter annealing gradually removes auxiliary rewards for parallel execution, so the model eventually focuses on the main task result.

What are the GPU memory requirements for running K2.5?

The complete 1T-parameter model takes up roughly 600GB of storage. By using the quantized Unsloth Dynamic 1.8-bit GGUF variant, the required disk space drops to about 240GB. When running the model in full precision with vLLM or SGLang, it has to be split across several GPUs with tensor parallelism; the documentation suggests using -tp 8 for an eight-way setup.

What is the Decoupled Encoder Process (DEP) and why does it matter for training efficiency?

In conventional pipeline parallelism, the vision encoder is placed in Stage-0 together with the text embeddings. This can lead to major workload imbalance, since image inputs differ strongly in resolution and number. DEP addresses this by splitting the vision forward pass, backbone training, and vision recomputation into three separate phases during each training step. As a result, the workload is distributed more evenly without the need for custom pipeline setups, enabling K2.5 to achieve 90% of the efficiency of text-only training despite the additional multimodal workload.

Final Thoughts

The most notable aspect is the structured way Moonshot AI designed K2.5. The team first used joint multimodal pre-training to build a solid vision-text base, then applied text-only supervised fine-tuning to maintain generalization, and finally organized reinforcement learning around capabilities instead of input modalities. This sequence suggests a clear understanding of which skills the model should acquire at each stage. PARL is especially forward-looking because it treats parallelization as behavior the system should learn, rather than as something fixed in advance. Its handling of serial collapse through staged rewards also shows a strong focus on reliable agent behavior at scale. Toggle follows the same logic by balancing inference-time scaling with budget optimization instead of framing them as opposing goals.

For users and developers, K2.5 is unusually approachable: it is a 1T-parameter MoE model with 32B active parameters, released under a Modified MIT license and usable with vLLM or SGLang. Unsloth’s quantized GGUF versions reduce the entry barrier further. For anyone testing multimodal reasoning systems or developing agents that coordinate parallel workflows, K2.5 is worth serious attention.

Moonshot AI also continues to support the open-model ecosystem by publishing capable models together with detailed technical reports. We are interested to see future releases from this team and from other open-model projects that pair strong documentation with broad community adoption.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: