Content

1 Key Takeaways
2 Reinforcement Learning from Verifiable Rewards
3 RL Environments for Products
4 Creating an RL Environment
5 Consider What Is Needed
6 Clone the Repository and Set Up the Environment
7 Prepare a Dataset for RL Training
8 Create a YAML Config to Configure a Training Run
9 Train the Model
10 Evaluate and Iterate
11 FAQ
12 Final Thoughts

Vijona

1 hour ago

RL Environments for LLMs and Autonomous AI Systems

A continuing area of strong interest for AI researchers and engineers is the use of LLMs in end-to-end autonomous systems built with multi-agent architectures. Although LLMs are powerful on their own, the industry is increasingly looking toward Reinforcement Learning (RL) environments to unlock more practical value from them.

RL environments are not a new concept; they existed before LLMs became widely used. In fact, it is difficult to discuss agents without also discussing environments. In a typical RL setting, an environment gives an agent a reward or a penalty based on the action the agent takes within that environment. The agent then has to adjust its behaviour to increase its total reward over time. This process of adapting to maximize reward is the core idea behind reinforcement learning.

With the rise of LLMs, the agent is often a model. The model’s weights are updated based on how its attempts at different tasks are scored, allowing it to improve over time. Computer use, meaning an AI system that can operate and navigate a computer, is an especially interesting multi-agent task. This topic has also been explored in relation to research on scaling computer-use data with multi-agent pipelines for models such as Fara-7B.

Key Takeaways

The industry is increasingly using Reinforcement Learning (RL) environments to gain more practical value from Large Language Models (LLMs).
An RL environment gives a reward or penalty for an action taken by an agent, often an LLM or model, which encourages the agent to adapt its weights and maximize cumulative reward.
Unlike subjective rewards used in RLHF, RLVR relies on objective and verifiable rewards, such as those found in math and coding tasks. These rewards are difficult to exploit and help ensure that the model develops the intended reasoning abilities.
Organizations are using RL environments, often referred to as harnesses or UI gyms, to train models for specific use inside their own software products, such as code assistants, development environments, and agent-based workflows.

Reinforcement Learning from Verifiable Rewards

The recent rise in discussion and interest around RL environments can likely be linked to the success of RLVR, or Reinforcement Learning from Verifiable Rewards. In this approach, tasks can be checked objectively, such as with mathematics or code. The key reason RLVR is effective is that verifiable rewards are difficult to game. A non-gameable reward function is connected directly to a successful and measurable task outcome, such as solving a problem or passing a test case. This makes it difficult for an LLM to receive a high reward without actually learning the reasoning and problem-solving strategies required to complete the task, reducing the risk of reward hacking.

RL Environments for Products

Models can be trained for a specific product by placing them inside a harness, which is essentially an RL environment that represents the product. Examples include AI coding workflows, code generation tools, and interactive development agents. In a similar direction, companies are beginning to create environments around their own software. These environments are often described as UI gyms.

Creating an RL Environment

There are many possible ways to build RL environments. The first step is to define the goal. What should the model be able to accomplish? After that, a framework needs to be selected. Depending on the framework, the environment will be described and implemented in different ways.

Possible frameworks include Prime Intellect’s environments hub, SkyRL with reusable tools, PyTorch’s OpenEnv, and OpenAI’s Gymnasium. Thinking Machines also provides documentation and a cookbook for working with RL environments.

No matter which framework is used, the main components of the RL environment usually need to be defined clearly.

State Space

The state space is the information the agent can observe. This may include pixels from a game screen, numerical sensor data, screenshots, or other representations of the surrounding world.

Action Space

The action space includes every possible action the agent can perform. These actions may be discrete, such as pressing buttons, or continuous, such as controlling motors.

Reward Function

The reward function influences which behaviours the agent learns. Sparse rewards, which are only given when a task is completed, can be difficult to learn from. Dense rewards, which provide frequent feedback, can sometimes encourage unintended behaviours.

Episode Termination Conditions

Episode termination conditions define when a trial ends. This could happen when a goal is reached, a time limit is exceeded, or a failure state is entered.

Once these elements are defined, the next step is to implement the environment dynamics, meaning the rules that determine how states change in response to actions.

Consider What Is Needed

Start by setting up a GPU-enabled virtual machine or cloud server. Pay attention to how many GPUs are required. In this example, 4 H100 GPUs are used. A Weights and Biases account is also required.

Clone the Repository and Set Up the Environment

Copy Code

git clone https://github.com/NovaSky-AI/SkyRL.git cd SkyRL uv venv .venv source .venv/bin/activate uv pip install -e ".[vllm]" ##or ".[sglang]" for alternative inference backends uv pip install -r requirements.txt # may need to: snap install astral-uv

Prepare a Dataset for RL Training

SkyRL expects data in Parquet format with a schema designed for instruction and RL workflows, including prompts, completions, rewards, and similar fields.

A built-in example can be used, such as GSM8K for math reasoning. This is a useful starting point before moving toward SWE-Bench-style tasks. A custom dataset can also be prepared for a custom environment.

In this example, GSM8K data is generated in a reasoning and tool-use style with gsm8k_dataset.py.

Copy Code

cd skyrl-train uv run examples/gsm8k/gsm8k_dataset.py --output_dir ~/data/gsm8k

This creates train.parquet and validation.parquet with fields such as:

prompt
completion or trajectories
reward for offline or hybrid setups, while online RL computes rewards during execution

For more agentic tasks, SWE-Bench or a similar benchmark focused on verifiable tasks could be used. SkyRL also integrates the OpenHands runtime through SkyRL-OpenHands for code-editing environments.

Create a YAML Config to Configure a Training Run

Create a YAML file to configure a GRPO training run: examples/gsm8k/gsm8k-grpo.yaml.

Copy Code

data: train_data: ["~/data/gsm8k/train.parquet"] val_data: ["~/data/gsm8k/validation.parquet"] trainer: algorithm: name: grpo advantage_estimator: grpo policy: model: path: Qwen/Qwen2.5-1.5B-Instruct # Start small; scale to 7B–32B epochs: 2 # Increase for real training strategy: fsdp2 # Or ddp for single node placement: colocate_all: true policy_num_gpus_per_node: 4 # Adjust to your hardware inference: backend: vllm # Fast inference for rollouts logger: wandb

Train the Model

For a single node, use the following command:

Copy Code


uv run -m skyrl_train.entrypoints.main_base \
  --config examples/gsm8k/gsm8k-grpo.yaml \
  trainer.epochs=5 \
  data.train_data='["~/data/gsm8k/train.parquet"]'

For distributed training across multiple GPUs or nodes using Ray and SkyPilot, use:

Copy Code

sky launch skyrl_train/examples/gsm8k/gsm8k-grpo-skypilot.yaml \ --secret WANDB_API_KEY=your_key_here

Evaluate and Iterate

Copy Code


from skyrl.agent import SkyRLAgent

agent = SkyRLAgent.from_checkpoint("path/to/checkpoint")
result = agent.run_task(
    prompt="Fix this bug in repo X: ...",
    runtime="openhands",   # Stateful code env
    max_turns=30
)
print(result.success_rate, result.trajectory)

FAQ

What Are RL Environments?

In Reinforcement Learning (RL), an environment gives an agent, often an LLM or model, a reward or penalty for an action it performs. The agent then adapts its behaviour to maximize cumulative reward.

Why Is the Industry Using RL Environments for LLMs?

The industry is adopting RL environments to bring LLMs into end-to-end autonomous systems based on multi-agent architectures. This is increasingly seen as a practical way to gain real value from these models.

What Is Reinforcement Learning from Verifiable Rewards (RLVR)?

RLVR is a method that uses objective and verifiable rewards, such as those found in math and code tasks. These non-gameable rewards help ensure that the model develops the intended reasoning and problem-solving strategies, unlike more subjective rewards used in RLHF.

How Are RL Environments Used for Commercial Products?

Companies are building environments around their own software, often called harnesses or UI gyms, to train models for specific use inside their products. Examples include coding assistants, development tools, and agent-based software workflows.

How Can RL Environments Be Used for Synthetic Data Generation?

Environments naturally contain ground truth, such as passing unit tests, correct spreadsheet results, or accurate terminal outputs. This makes them useful for producing high-quality synthetic data.

Final Thoughts

RL environments are expected to help integrate AI models more closely into real-world use cases. They make it possible to train models specifically for particular applications. Modern cloud infrastructures can support these AI initiatives by providing the resources required for model training, inference, and agent development.

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

vLLM GPU Sizing Guide for LLM Inference

AI/ML, Tutorial

20 minutes ago

Vijona20 minutes ago How to Size and Configure GPUs for vLLM Inference Effective GPU sizing and configuration for vLLM inference begins with a solid understanding of the two main phases…

Boltz-2 AI Model: Breakthrough in Drug Discovery & Binding Affinity Prediction

AI/ML, Tutorial

5 days ago

Vijona18 Jun at 11:07 Boltz-2: Open-Source AI for Biomolecular Structure Prediction and Drug Discovery Drug discovery often requires 10 to 15 years and can cost billions of dollars, while failure…

How to Create Agent Skills for LLMs: Build Modular AI Workflows

AI/ML, Tutorial

5 days ago

Vijona18 Jun at 10:38 How to Create and Use Agent Skills Agent Skills are folders that contain instructions, scripts, and supporting resources that a Large Language Model (LLM) can load…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

RL Environments for LLMs and Autonomous AI Systems

Key Takeaways

Reinforcement Learning from Verifiable Rewards

RL Environments for Products

Creating an RL Environment

State Space

Action Space

Reward Function

Episode Termination Conditions

Consider What Is Needed

Clone the Repository and Set Up the Environment

Prepare a Dataset for RL Training

Create a YAML Config to Configure a Training Run

Train the Model