Kimi K2 Post-Training: Tool Use, Data Synthesis, and Reinforcement Learning

In an earlier piece, we covered Kimi K2, including its MoE design, the MuonClip optimizer, and various performance-focused improvements. One major topic we didn’t explore in sufficient detail was post-training—and it may be the most compelling part of the whole story.

Kimi K2 is particularly notable because it is an agentic model trained with tool usage as a core capability. In what many describe as the “Era of Experience,” post-training becomes central. As the Kimi team states in their Kimi K2 launch post, “LLMs learn from their own self-generated interactions – receiving rewards that free them from the limits of human data and surpass human capabilities.” This reflects a broader shift in the conversation—from achieving human-level task competence (often framed as AGI) toward performance that exceeds it.

This article examines Kimi K2’s approach to post-training: the way it creates agentic synthetic data, aligns behavior using verifiable rewards and self-critic signals, and scales reinforcement learning infrastructure.

For deeper context, we suggest reading the Kimi K2 and K1.5 technical report as well as the Muon paper on scaling LLM training alongside this article.

Feel free to skip any sections that aren’t relevant to your needs.

Key Takeaways

Post-training plays a defining role for agentic models such as Kimi K2: it sharpens the model’s behavior so it becomes both helpful and safe, particularly in the “Era of Experience,” where LLMs learn through self-produced interactions in ways that can exceed human capabilities.

Kimi K2’s post-training blends synthetic data production for SFT and RL: it relies on large-scale tool-use synthetic datasets for Supervised Fine-Tuning (SFT), then applies a Reinforcement Learning (RL) framework that incorporates both verifiable and non-verifiable reward signals.

Synthesizing tool-use data follows three stages: first, building a library of tool specifications (covering real-world tools and synthetic ones), then producing a wide variety of agents and tasks, and finally generating effective multi-turn trajectories inside simulated environments.

Verifiable Rewards Gym is central to K2’s RL approach: it applies straightforward rule-based functions with binary rewards (1 for correct outputs, 0 for incorrect ones) across categories such as Math, STEM, Logic, Complex Instruction Following, Faithfulness, Coding, and Safety.

Non-verifiable rewards rely on a self-critic method: for subjective work like creative writing, K2 performs pairwise comparisons of its own responses, guided by rubrics that include core values (clarity, conversational fluency, objective interaction) and prescriptive constraints (no initial praise, no justification).

On-policy rollouts strengthen the critic’s evaluation on complex tasks that do not have clear reward functions: verifiable rewards are used during rollouts to continuously refine the critic. By transferring learning from verifiable domains, performance improves on tasks judged via non-verifiable rewards.

Before going further, let’s clarify the difference between pre-training and post-training for readers who may not be familiar with the distinction.

Pre-training vs Post-training

Pre-training describes the first stage of building an LLM, where the model is trained on vast datasets—typically gathered from the internet, books, and other sources. In this phase, the model learns next-token prediction through self-supervised learning, forming linguistic or multimodal understanding, factual knowledge, and reasoning skill. This stage demands massive compute and produces a base model that can generate text, but it may still struggle with instruction-following or aligning with human preferences.

Post-training refers to the set of methods applied after pre-training to shape the model’s behavior so it becomes more useful and safe. This includes supervised fine-tuning (SFT) on high-quality instruction-following datasets and reinforcement learning from human feedback (RLHF) to bring outputs closer to human values. Post-training converts a raw pre-trained model into one that can reliably follow instructions, hold conversations, and behave in ways people expect.

As mentioned earlier, the focus here is Kimi K2’s post-training pipeline, which merges large-scale synthetic tool-use data for SFT with a unified RL framework that uses both verifiable rewards and self-critic signals.

We’ll start with supervised fine-tuning and then move into reinforcement learning.

Supervised Fine-Tuning

Supervised fine-tuning (SFT) adapts pre-trained models to particular use cases by training on labeled data. This improves performance on tasks like question answering, summarization, and dialogue.

From the prior Kimi K2 article on the token-efficient Muon optimizer, recall that Muon is not only part of K2’s pre-training process—it is also applied during SFT. The researchers additionally recommend using the optimizer for anyone aiming to fine-tune the model further.

The researchers also bootstrapped K2’s critic capability during the SFT phase (K2, section 3.2.2), enabling it to evaluate non-verifiable reward settings.

In the next section, we’ll outline how the SFT dataset was created.

Data Synthesis for Tool Use

Here, the researchers describe three stages:

1) Build a Repository of Tool Specifications

The first step is constructing a repository of tool specs from real-world tools and LLM-oriented tools. Two approaches were used to collect tools: (1) they gathered 3000+ MCP tools from GitHub repositories, and (2) they applied methods from WizardLM—described as “creating large amounts of instruction data with varying levels of complexity using LLM instead of humans”—to “evolve” synthetic tools.

2) Produce Agents and Tasks from Tool Sets

Next, an agent is generated for each tool-set sampled from the tool repository. The researchers created thousands of varied agents by combining system prompts with different tool bundles. For each agent configuration, they also produced tasks and evaluation rubrics.

3) Generate Multi-turn Trajectories

Finally, trajectories are generated for each agent and task. The researchers built simulated environments where tool calls were executed and persistent state was maintained. They recorded interactions between synthetic user agents and tool-using agents as multi-turn trajectories, keeping only those interactions that were successful according to predefined rubrics.

Reinforcement Learning

Kimi K1.5: Scaling Reinforcement Learning with LLMs showed how novel techniques can make RL effective at scale. RL is often viewed as more token-efficient and better at generalization than SFT, making it an important area for optimization. In this section, we’ll examine K2’s Verifiable Rewards Gym (Section 3.2.1 of the Tech Report) and the rubrics used for non-verifiable rewards.

Verifiable Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) relies on straightforward, rule-driven functions to check whether a model’s response is correct. The reward signal is binary: a 1 is given for correct results and a 0 for incorrect ones. In Kimi K2’s case, the evaluation criteria can be as simple as whether a coding solution passes predefined test cases.

Moonshot expanded this concept into a Verifiable Rewards Gym—an extensible library of task templates with clear evaluation logic—built from datasets spanning the domains outlined below:

Domain Techniques / Data Sources Focus Areas Evaluation Methods
Math, STEM, and Logic Expert annotations, internal QA extraction pipelines, open datasets (e.g., NuminaMath, AIMO-2) Multi-hop tabular reasoning, logic puzzles (24-game, Sudoku, riddles, cryptarithms, Morse code decoding) – all of moderate task difficulty Tags to increase coverage of undercovered domains, difficulty filtering using SFT model’s pass@k accuracy
Complex Instruction Following Two verification mechanisms: (1) Code interpreter looking at instructions with verifiable outputs (e.g., length, style constraints) (2) LLM-as-judge for more nuanced evaluation; Additional “hack-check” layer to ensure model isn’t pretending to have followed instructions. Training data comes from three sources: expert-crafted prompts, automated instruction augmentation (inspired by AutoIF), and a model fine-tuned to generate edge cases. Instruction following, edge case robustness, consistency over dialogues Rubric-based scoring, “hack-check” layer for deceptive completions
Faithfulness Sentence-level faithfulness judge trained using FACTS Grounding framework, verifying factual grounding of self-generated reasoning chains, automated detection of unsupported claims in output Factual accuracy, grounding verification, claim validation Automated faithfulness scoring, unsupported claim detection
Coding & Software Engineering Open-source coding datasets (e.g., OpenCoder, Kodcode), human-written unit tests from pre-training data, GitHub PRs and issues Competitive programming, pull request generation, multi-file reasoning Unit test pass rates, execution in real sandboxes (Kubernetes-based) (K1.5, section 2.6.4)
Safety Human-curated seed prompts, prompt evolution pipeline: attack model, target model, judge model Jailbreak detection, toxic or harmful outputs Attack model crafts adversarial prompts to test the target model’s limits, while the judge model assesses the response, awarding a binary reward (success/failure) based on a task-specific rubric

Non-verifiable Rewards

For tasks driven by subjective preferences—such as creative writing and open-ended question answering—a self-critic reward is applied. In these cases, K2 conducts pairwise comparisons between its own candidate outputs.

Category Rubric Description
Core: to encompass Kimi’s fundamental values as a helpful AI assistant Clarity & Relevance Be concise, stay on-topic, avoid unnecessary details
Core: to encompass Kimi’s fundamental values as a helpful AI assistant Conversational Fluency Natural dialogue, appropriate engagement, judicious follow-ups
Core: to encompass Kimi’s fundamental values as a helpful AI assistant Objective Interaction Stay grounded, avoid metacommentary and excessive praise
Prescriptive: aim to eliminate reward hacking No Initial Praise Don’t start with “Great question!” or similar compliments
Prescriptive: aim to eliminate reward hacking No Justification Don’t explain why your response is good or successful
Human Annotated: For specific instructional contexts Varies Varies

Rollouts

In reinforcement learning and agent development, rollouts describe running an agent through episodes—or sequences of interactions with an environment—to gather experience data. During a rollout, the agent follows its current policy, takes actions, receives observations and rewards, and continues until the episode ends, either naturally or after a maximum number of steps. The result is a trajectory, meaning a sequence of state-action-reward tuples that can be used for learning.

In this setup, on-policy rollouts with verifiable rewards were used to repeatedly refine the critic, increasing its evaluation accuracy under the newest policy. Put differently, verifiable reward signals were used to improve how non-verifiable rewards are estimated.

For readers looking to strengthen their intuition around Reinforcement Learning, we recommend the Hugging Face Deep Reinforcement Learning Course.

Note

We didn’t discuss RL infrastructure in this article (it will be updated soon), and we encourage interested readers to consult the Kimi papers (K1.5, section 2.6 and K2, section 3.3 and Appendix G) in the meantime.

Conclusion

By pairing large-scale synthetic tool-use data for SFT with both verifiable and self-critic rewards in RL, Kimi K2 presents a strong approach for aligning model behaviour. With its post-training emphasis for agentic models—especially in the “Era of Experience”—Kimi K2 stands out as a noteworthy model in the push toward more intelligent and adaptable AI systems.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: