RL Environments for LLMs and Autonomous AI Systems
A continuing area of strong interest for AI researchers and engineers is the use of LLMs in end-to-end autonomous systems built with multi-agent architectures. Although LLMs are powerful on their own, the industry is increasingly looking toward Reinforcement Learning (RL) environments to unlock more practical value from them.
RL environments are not a new concept; they existed before LLMs became widely used. In fact, it is difficult to discuss agents without also discussing environments. In a typical RL setting, an environment gives an agent a reward or a penalty based on the action the agent takes within that environment. The agent then has to adjust its behaviour to increase its total reward over time. This process of adapting to maximize reward is the core idea behind reinforcement learning.
With the rise of LLMs, the agent is often a model. The model’s weights are updated based on how its attempts at different tasks are scored, allowing it to improve over time. Computer use, meaning an AI system that can operate and navigate a computer, is an especially interesting multi-agent task. This topic has also been explored in relation to research on scaling computer-use data with multi-agent pipelines for models such as Fara-7B.
Key Takeaways
- The industry is increasingly using Reinforcement Learning (RL) environments to gain more practical value from Large Language Models (LLMs).
- An RL environment gives a reward or penalty for an action taken by an agent, often an LLM or model, which encourages the agent to adapt its weights and maximize cumulative reward.
- Unlike subjective rewards used in RLHF, RLVR relies on objective and verifiable rewards, such as those found in math and coding tasks. These rewards are difficult to exploit and help ensure that the model develops the intended reasoning abilities.
- Organizations are using RL environments, often referred to as harnesses or UI gyms, to train models for specific use inside their own software products, such as code assistants, development environments, and agent-based workflows.
Reinforcement Learning from Verifiable Rewards
The recent rise in discussion and interest around RL environments can likely be linked to the success of RLVR, or Reinforcement Learning from Verifiable Rewards. In this approach, tasks can be checked objectively, such as with mathematics or code. The key reason RLVR is effective is that verifiable rewards are difficult to game. A non-gameable reward function is connected directly to a successful and measurable task outcome, such as solving a problem or passing a test case. This makes it difficult for an LLM to receive a high reward without actually learning the reasoning and problem-solving strategies required to complete the task, reducing the risk of reward hacking.
RL Environments for Products
Models can be trained for a specific product by placing them inside a harness, which is essentially an RL environment that represents the product. Examples include AI coding workflows, code generation tools, and interactive development agents. In a similar direction, companies are beginning to create environments around their own software. These environments are often described as UI gyms.
Creating an RL Environment
There are many possible ways to build RL environments. The first step is to define the goal. What should the model be able to accomplish? After that, a framework needs to be selected. Depending on the framework, the environment will be described and implemented in different ways.
Possible frameworks include Prime Intellect’s environments hub, SkyRL with reusable tools, PyTorch’s OpenEnv, and OpenAI’s Gymnasium. Thinking Machines also provides documentation and a cookbook for working with RL environments.
No matter which framework is used, the main components of the RL environment usually need to be defined clearly.
State Space
The state space is the information the agent can observe. This may include pixels from a game screen, numerical sensor data, screenshots, or other representations of the surrounding world.
Action Space
The action space includes every possible action the agent can perform. These actions may be discrete, such as pressing buttons, or continuous, such as controlling motors.
Reward Function
The reward function influences which behaviours the agent learns. Sparse rewards, which are only given when a task is completed, can be difficult to learn from. Dense rewards, which provide frequent feedback, can sometimes encourage unintended behaviours.
Episode Termination Conditions
Episode termination conditions define when a trial ends. This could happen when a goal is reached, a time limit is exceeded, or a failure state is entered.
Once these elements are defined, the next step is to implement the environment dynamics, meaning the rules that determine how states change in response to actions.
Consider What Is Needed
Start by setting up a GPU-enabled virtual machine or cloud server. Pay attention to how many GPUs are required. In this example, 4 H100 GPUs are used. A Weights and Biases account is also required.
Clone the Repository and Set Up the Environment
git clone https://github.com/NovaSky-AI/SkyRL.git
cd SkyRL
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[vllm]" ##or ".[sglang]" for alternative inference backends
uv pip install -r requirements.txt
# may need to:
snap install astral-uv
Prepare a Dataset for RL Training
SkyRL expects data in Parquet format with a schema designed for instruction and RL workflows, including prompts, completions, rewards, and similar fields.
A built-in example can be used, such as GSM8K for math reasoning. This is a useful starting point before moving toward SWE-Bench-style tasks. A custom dataset can also be prepared for a custom environment.
In this example, GSM8K data is generated in a reasoning and tool-use style with gsm8k_dataset.py.
cd skyrl-train
uv run examples/gsm8k/gsm8k_dataset.py --output_dir ~/data/gsm8k
This creates train.parquet and validation.parquet with fields such as:
promptcompletionor trajectoriesrewardfor offline or hybrid setups, while online RL computes rewards during execution
For more agentic tasks, SWE-Bench or a similar benchmark focused on verifiable tasks could be used. SkyRL also integrates the OpenHands runtime through SkyRL-OpenHands for code-editing environments.
Create a YAML Config to Configure a Training Run
Create a YAML file to configure a GRPO training run: examples/gsm8k/gsm8k-grpo.yaml.
data:
train_data: ["~/data/gsm8k/train.parquet"]
val_data: ["~/data/gsm8k/validation.parquet"]
trainer:
algorithm:
name: grpo
advantage_estimator: grpo
policy:
model:
path: Qwen/Qwen2.5-1.5B-Instruct # Start small; scale to 7B–32B
epochs: 2 # Increase for real training
strategy: fsdp2 # Or ddp for single node
placement:
colocate_all: true
policy_num_gpus_per_node: 4 # Adjust to your hardware
inference:
backend: vllm # Fast inference for rollouts
logger: wandb
Train the Model
For a single node, use the following command:
uv run -m skyrl_train.entrypoints.main_base \
--config examples/gsm8k/gsm8k-grpo.yaml \
trainer.epochs=5 \
data.train_data='["~/data/gsm8k/train.parquet"]'
For distributed training across multiple GPUs or nodes using Ray and SkyPilot, use:
sky launch skyrl_train/examples/gsm8k/gsm8k-grpo-skypilot.yaml \
--secret WANDB_API_KEY=your_key_here
Evaluate and Iterate
from skyrl.agent import SkyRLAgent
agent = SkyRLAgent.from_checkpoint("path/to/checkpoint")
result = agent.run_task(
prompt="Fix this bug in repo X: ...",
runtime="openhands", # Stateful code env
max_turns=30
)
print(result.success_rate, result.trajectory)
FAQ
What Are RL Environments?
In Reinforcement Learning (RL), an environment gives an agent, often an LLM or model, a reward or penalty for an action it performs. The agent then adapts its behaviour to maximize cumulative reward.
Why Is the Industry Using RL Environments for LLMs?
The industry is adopting RL environments to bring LLMs into end-to-end autonomous systems based on multi-agent architectures. This is increasingly seen as a practical way to gain real value from these models.
What Is Reinforcement Learning from Verifiable Rewards (RLVR)?
RLVR is a method that uses objective and verifiable rewards, such as those found in math and code tasks. These non-gameable rewards help ensure that the model develops the intended reasoning and problem-solving strategies, unlike more subjective rewards used in RLHF.
How Are RL Environments Used for Commercial Products?
Companies are building environments around their own software, often called harnesses or UI gyms, to train models for specific use inside their products. Examples include coding assistants, development tools, and agent-based software workflows.
How Can RL Environments Be Used for Synthetic Data Generation?
Environments naturally contain ground truth, such as passing unit tests, correct spreadsheet results, or accurate terminal outputs. This makes them useful for producing high-quality synthetic data.
Final Thoughts
RL environments are expected to help integrate AI models more closely into real-world use cases. They make it possible to train models specifically for particular applications. Modern cloud infrastructures can support these AI initiatives by providing the resources required for model training, inference, and agent development.


