Content

1 Key Takeaways
2 Pre-training vs Post-training
3 Supervised Fine-Tuning
4 Data Synthesis for Tool Use
5 Reinforcement Learning
6 Verifiable Rewards
7 Non-verifiable Rewards
8 Rollouts
9 Note
10 Conclusion

Vijona

3 hours ago

Kimi K2 Post-Training: Tool Use, Data Synthesis, and Reinforcement Learning

In an earlier piece, we covered Kimi K2, including its MoE design, the MuonClip optimizer, and various performance-focused improvements. One major topic we didn’t explore in sufficient detail was post-training—and it may be the most compelling part of the whole story.

Kimi K2 is particularly notable because it is an agentic model trained with tool usage as a core capability. In what many describe as the “Era of Experience,” post-training becomes central. As the Kimi team states in their Kimi K2 launch post, “LLMs learn from their own self-generated interactions – receiving rewards that free them from the limits of human data and surpass human capabilities.” This reflects a broader shift in the conversation—from achieving human-level task competence (often framed as AGI) toward performance that exceeds it.

This article examines Kimi K2’s approach to post-training: the way it creates agentic synthetic data, aligns behavior using verifiable rewards and self-critic signals, and scales reinforcement learning infrastructure.

For deeper context, we suggest reading the Kimi K2 and K1.5 technical report as well as the Muon paper on scaling LLM training alongside this article.

Feel free to skip any sections that aren’t relevant to your needs.

Key Takeaways

Post-training plays a defining role for agentic models such as Kimi K2: it sharpens the model’s behavior so it becomes both helpful and safe, particularly in the “Era of Experience,” where LLMs learn through self-produced interactions in ways that can exceed human capabilities.

Kimi K2’s post-training blends synthetic data production for SFT and RL: it relies on large-scale tool-use synthetic datasets for Supervised Fine-Tuning (SFT), then applies a Reinforcement Learning (RL) framework that incorporates both verifiable and non-verifiable reward signals.

Synthesizing tool-use data follows three stages: first, building a library of tool specifications (covering real-world tools and synthetic ones), then producing a wide variety of agents and tasks, and finally generating effective multi-turn trajectories inside simulated environments.

Verifiable Rewards Gym is central to K2’s RL approach: it applies straightforward rule-based functions with binary rewards (1 for correct outputs, 0 for incorrect ones) across categories such as Math, STEM, Logic, Complex Instruction Following, Faithfulness, Coding, and Safety.

Non-verifiable rewards rely on a self-critic method: for subjective work like creative writing, K2 performs pairwise comparisons of its own responses, guided by rubrics that include core values (clarity, conversational fluency, objective interaction) and prescriptive constraints (no initial praise, no justification).

On-policy rollouts strengthen the critic’s evaluation on complex tasks that do not have clear reward functions: verifiable rewards are used during rollouts to continuously refine the critic. By transferring learning from verifiable domains, performance improves on tasks judged via non-verifiable rewards.

Before going further, let’s clarify the difference between pre-training and post-training for readers who may not be familiar with the distinction.

Pre-training vs Post-training

Pre-training describes the first stage of building an LLM, where the model is trained on vast datasets—typically gathered from the internet, books, and other sources. In this phase, the model learns next-token prediction through self-supervised learning, forming linguistic or multimodal understanding, factual knowledge, and reasoning skill. This stage demands massive compute and produces a base model that can generate text, but it may still struggle with instruction-following or aligning with human preferences.

Post-training refers to the set of methods applied after pre-training to shape the model’s behavior so it becomes more useful and safe. This includes supervised fine-tuning (SFT) on high-quality instruction-following datasets and reinforcement learning from human feedback (RLHF) to bring outputs closer to human values. Post-training converts a raw pre-trained model into one that can reliably follow instructions, hold conversations, and behave in ways people expect.

As mentioned earlier, the focus here is Kimi K2’s post-training pipeline, which merges large-scale synthetic tool-use data for SFT with a unified RL framework that uses both verifiable rewards and self-critic signals.

We’ll start with supervised fine-tuning and then move into reinforcement learning.

Supervised Fine-Tuning

Supervised fine-tuning (SFT) adapts pre-trained models to particular use cases by training on labeled data. This improves performance on tasks like question answering, summarization, and dialogue.

From the prior Kimi K2 article on the token-efficient Muon optimizer, recall that Muon is not only part of K2’s pre-training process—it is also applied during SFT. The researchers additionally recommend using the optimizer for anyone aiming to fine-tune the model further.

The researchers also bootstrapped K2’s critic capability during the SFT phase (K2, section 3.2.2), enabling it to evaluate non-verifiable reward settings.

In the next section, we’ll outline how the SFT dataset was created.

Data Synthesis for Tool Use

Here, the researchers describe three stages:

1) Build a Repository of Tool Specifications

The first step is constructing a repository of tool specs from real-world tools and LLM-oriented tools. Two approaches were used to collect tools: (1) they gathered 3000+ MCP tools from GitHub repositories, and (2) they applied methods from WizardLM—described as “creating large amounts of instruction data with varying levels of complexity using LLM instead of humans”—to “evolve” synthetic tools.

2) Produce Agents and Tasks from Tool Sets

Next, an agent is generated for each tool-set sampled from the tool repository. The researchers created thousands of varied agents by combining system prompts with different tool bundles. For each agent configuration, they also produced tasks and evaluation rubrics.

3) Generate Multi-turn Trajectories

Finally, trajectories are generated for each agent and task. The researchers built simulated environments where tool calls were executed and persistent state was maintained. They recorded interactions between synthetic user agents and tool-using agents as multi-turn trajectories, keeping only those interactions that were successful according to predefined rubrics.

Reinforcement Learning

Kimi K1.5: Scaling Reinforcement Learning with LLMs showed how novel techniques can make RL effective at scale. RL is often viewed as more token-efficient and better at generalization than SFT, making it an important area for optimization. In this section, we’ll examine K2’s Verifiable Rewards Gym (Section 3.2.1 of the Tech Report) and the rubrics used for non-verifiable rewards.

Verifiable Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) relies on straightforward, rule-driven functions to check whether a model’s response is correct. The reward signal is binary: a 1 is given for correct results and a 0 for incorrect ones. In Kimi K2’s case, the evaluation criteria can be as simple as whether a coding solution passes predefined test cases.

Moonshot expanded this concept into a Verifiable Rewards Gym—an extensible library of task templates with clear evaluation logic—built from datasets spanning the domains outlined below:

Domain	Techniques / Data Sources	Focus Areas	Evaluation Methods
Math, STEM, and Logic	Expert annotations, internal QA extraction pipelines, open datasets (e.g., NuminaMath, AIMO-2)	Multi-hop tabular reasoning, logic puzzles (24-game, Sudoku, riddles, cryptarithms, Morse code decoding) – all of moderate task difficulty	Tags to increase coverage of undercovered domains, difficulty filtering using SFT model’s pass@k accuracy
Complex Instruction Following	Two verification mechanisms: (1) Code interpreter looking at instructions with verifiable outputs (e.g., length, style constraints) (2) LLM-as-judge for more nuanced evaluation; Additional “hack-check” layer to ensure model isn’t pretending to have followed instructions. Training data comes from three sources: expert-crafted prompts, automated instruction augmentation (inspired by AutoIF), and a model fine-tuned to generate edge cases.	Instruction following, edge case robustness, consistency over dialogues	Rubric-based scoring, “hack-check” layer for deceptive completions
Faithfulness	Sentence-level faithfulness judge trained using FACTS Grounding framework, verifying factual grounding of self-generated reasoning chains, automated detection of unsupported claims in output	Factual accuracy, grounding verification, claim validation	Automated faithfulness scoring, unsupported claim detection
Coding & Software Engineering	Open-source coding datasets (e.g., OpenCoder, Kodcode), human-written unit tests from pre-training data, GitHub PRs and issues	Competitive programming, pull request generation, multi-file reasoning	Unit test pass rates, execution in real sandboxes (Kubernetes-based) (K1.5, section 2.6.4)
Safety	Human-curated seed prompts, prompt evolution pipeline: attack model, target model, judge model	Jailbreak detection, toxic or harmful outputs	Attack model crafts adversarial prompts to test the target model’s limits, while the judge model assesses the response, awarding a binary reward (success/failure) based on a task-specific rubric

Non-verifiable Rewards

For tasks driven by subjective preferences—such as creative writing and open-ended question answering—a self-critic reward is applied. In these cases, K2 conducts pairwise comparisons between its own candidate outputs.

Category	Rubric	Description
Core: to encompass Kimi’s fundamental values as a helpful AI assistant	Clarity & Relevance	Be concise, stay on-topic, avoid unnecessary details
Core: to encompass Kimi’s fundamental values as a helpful AI assistant	Conversational Fluency	Natural dialogue, appropriate engagement, judicious follow-ups
Core: to encompass Kimi’s fundamental values as a helpful AI assistant	Objective Interaction	Stay grounded, avoid metacommentary and excessive praise
Prescriptive: aim to eliminate reward hacking	No Initial Praise	Don’t start with “Great question!” or similar compliments
Prescriptive: aim to eliminate reward hacking	No Justification	Don’t explain why your response is good or successful
Human Annotated: For specific instructional contexts	Varies	Varies

Rollouts

In reinforcement learning and agent development, rollouts describe running an agent through episodes—or sequences of interactions with an environment—to gather experience data. During a rollout, the agent follows its current policy, takes actions, receives observations and rewards, and continues until the episode ends, either naturally or after a maximum number of steps. The result is a trajectory, meaning a sequence of state-action-reward tuples that can be used for learning.

In this setup, on-policy rollouts with verifiable rewards were used to repeatedly refine the critic, increasing its evaluation accuracy under the newest policy. Put differently, verifiable reward signals were used to improve how non-verifiable rewards are estimated.

For readers looking to strengthen their intuition around Reinforcement Learning, we recommend the Hugging Face Deep Reinforcement Learning Course.

Note

We didn’t discuss RL infrastructure in this article (it will be updated soon), and we encourage interested readers to consult the Kimi papers (K1.5, section 2.6 and K2, section 3.3 and Appendix G) in the meantime.

Conclusion

By pairing large-scale synthetic tool-use data for SFT with both verifiable and self-critic rewards in RL, Kimi K2 presents a strong approach for aligning model behaviour. With its post-training emphasis for agentic models—especially in the “Era of Experience”—Kimi K2 stands out as a noteworthy model in the push toward more intelligent and adaptable AI systems.

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

PowerShell and Linux: Run Bash Commands with pwsh and WSL

Linux Basics, Tutorial

3 hours ago

Vijona3 hours ago Using Linux Commands in PowerShell: Cross-Platform Workflows with pwsh and WSL PowerShell and Linux are now far more connected than they used to be. Thanks to modern,…

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

Qwen3-Coder: 405B MoE Agentic Coding Model + Qwen Code CLI Guide

AI/ML, Tutorial

3 hours ago

Vijona3 hours ago Qwen3-Coder: An Agentic MoE Coding Model With 405B Parameters There have been a wave of Qwen launches lately. One of the most notable is Qwen3-Coder, an agentic…

Episodic Memory in AI Agents: Long-Term Context & Learning

AI/ML, Tutorial

3 hours ago

Vijona3 hours ago Episodic Memory for AI: Enabling Context-Aware, Continuously Learning Agents Artificial intelligence has progressed from rigid, rule-driven automation toward adaptable and versatile systems that can learn, reason, and…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

Kimi K2 Post-Training: Tool Use, Data Synthesis, and Reinforcement Learning

Key Takeaways

Pre-training vs Post-training

Supervised Fine-Tuning

Data Synthesis for Tool Use

1) Build a Repository of Tool Specifications

2) Produce Agents and Tasks from Tool Sets

3) Generate Multi-turn Trajectories

Reinforcement Learning

Verifiable Rewards

Non-verifiable Rewards

Rollouts

Note

Conclusion