Content

1 What This Article Covers
2 Key Takeaways
3 FaraGen: Data Generation for Computer Use Agents
4 Fara-7B
5 Evaluating Fara-7B
6 Running Fara-7B to Automate Computer Tasks
7 Final Thoughts

Vijona

Yesterday at 16:06

How Fara-7B Advances Computer Use Agent Models with Synthetic Web Task Data

Training computer use agent (CUA) models has traditionally been challenging. The main reason is the data bottleneck: there is no large-scale collection of real-world human-computer interaction data that researchers can simply use. When you consider how much text data was required to achieve the LLM performance we see today, and how the shortage of high-quality text data was addressed, the next step for improving CUA models may already seem clear.

If you are thinking of synthetic data generation, you are very close.

Because there have not been many existing CUA models available, generating synthetic data directly is difficult. However, one practical alternative is to build a scalable synthetic data generation engine for multi-step web tasks. This is exactly what researchers at Microsoft developed and described in detail in the paper introducing Fara-7B.

What This Article Covers

In this article, we explain how the researchers trained Fara-7B by first looking at how they addressed the shortage of computer-use data through a data engine called FaraGen. We also show how you can test the model yourself and observe Fara-7B completing computer-based tasks in practice.

Key Takeaways

FaraGen is a scalable synthetic data engine for web-based tasks. It uses a multi-agent system to:

Propose tasks based on real websites,
Solve tasks through collaborative agents and user feedback to create realistic trajectories,
Verify trajectories with LLM-based quality checks to produce high-fidelity data.

Fara-7B is a 7-billion-parameter Computer Use Agent model trained with data from FaraGen. It processes screenshots as input and performs complex, multi-step web tasks.

Fara-7B performs better than other CUA models of a similar size on benchmarks such as WebVoyager, Online-Mind2Web, and WebTailBench.

FaraGen: Data Generation for Computer Use Agents

Microsoft researchers introduced FaraGen, a data generation engine designed to create training data for CUA models. In this context, data refers to verified multi-step web trajectories, which FaraGen can generate at roughly one dollar per task. The FaraGen pipeline consists of three main phases: task proposal, task solving, and task verification.

Task Proposal

The task proposal stage focuses on creating realistic tasks. It asks what users would want a CUA to accomplish and what a CUA can realistically do. FaraGen uses high-value URLs from datasets such as ClueWeb22 and Tranco. ClueWeb22 was used more heavily because the researchers considered it to offer broader coverage of useful websites and fewer low-value corporate landing pages.

Some tasks were created from targeted URLs, which made up around 28% of the training data. These tasks were derived from raw URLs and refined into specific user intents that were both achievable and verifiable.

Most tasks came from agentic URL exploration, which accounted for around 67% of the training data. In this process, a multimodal LLM agent explored websites by processing screenshots and accessibility trees. The agent performed iterative actions to complete a task and refined the task based on what had already been done and the current page state.

The remaining tasks, around 5% of the training data, were generated with LLMs by mutating existing tasks into several similar variations.

Task Solving

FaraGen uses a multi-agent framework based on Magnetic-One to solve synthetic web tasks. It generates trajectories that include a complete sequence of observations, actions, and thoughts. These trajectories are then used for supervised fine-tuning to train Fara-7B. Later in this article, we describe how to run this model on a GPU-enabled server environment.

The task-solving process includes two main agents: the Orchestrator and the WebSurfer. A third agent, the UserSimulator, is activated when user input is needed. This enables multi-turn task completion. The system also includes Critical Points, which cause the model to stop and continue only after receiving user instructions.

Orchestrator Agent

The Orchestrator agent coordinates the overall process. Its primary role is to guide the WebSurfer, prevent common failure patterns, enforce Critical Points, and involve the UserSimulator agent when necessary.

The Orchestrator does this by maintaining a ledger. Based on previous and expected future actions from the WebSurfer, the Orchestrator predicts values for the ledger fields.

The is_in_loop and last_action_successful features are especially important because one of the WebSurfer agent’s most common failure modes is becoming stuck in repeated action loops.

Both the Orchestrator and WebSurfer agents can decide to stop at any point, which can create logic conflicts. If the task is not yet complete, the Orchestrator can override the WebSurfer’s stop decision.

Table 3 from the paper describes the decision hierarchy. Critical Points have the strongest authority and override all other flags, while WebSurfer stop decisions have the weakest authority. When the system forces the WebSurfer to stop, other actions are disabled instead of stopping the agent programmatically. This allows the WebSurfer to reason about why it was forced to stop, which helps Fara-7B generalize to new Critical Point scenarios.

After a task is completed, the Orchestrator identifies the URLs of task targets from the history so that verifiers can confirm whether the correct targets were reached.

WebSurfer Agent

The WebSurfer receives its instructions from the next_steps field in the ledger maintained by the Orchestrator. It performs actions such as clicking, typing, and scrolling in the browser through Playwright. The researchers used a managed browser environment to provide stable execution, ensuring that the WebSurfer’s actions on dynamic websites were completed consistently without crashes or timeouts.

UserSimulator

The UserSimulator agent is activated when the pipeline reaches a Critical Point and requires user input. It simulates human responses, such as giving consent or providing personal details, so the data generation process can continue.

Trajectory Verification

Besides the task completion flags used during the task-solving phase, FaraGen also applies several verifiers, which act as LLM judges, to check quality and correctness. Multiple verifiers are needed because different task types require different forms of evidence. Action-oriented tasks benefit from multimodal evidence checks, while information-seeking tasks require rubric-based scoring to assess quality.

Function	Target Failure Mode	Failure Example
Alignment verifier	Checks whether the final action history matches the user’s intent.	Logic errors, such as purchasing the wrong item.
Rubric verifier	Scores the trajectory against a predefined checklist of criteria.	Partial failures, such as finding a hotel but using the wrong dates.
Multimodal verifier	Reviews the final screenshot to confirm visible evidence of success.	Hallucinations, such as claiming the task is complete while the screen shows an error.

To demonstrate how effective FaraGen is, the researchers used its generated data to train Fara-7B. Fara-7B is a CUA model that understands computer interfaces only through screenshots, performs actions through predicted coordinates, and is compact enough to run on local devices.

Fara-7B

Fara-7B can be understood as a proof-of-concept distillation of the multi-agent solving system. It is based on the Qwen2.5-VL-7B vision-language model and was trained on 145,000 high-quality trajectories generated by the FaraGen pipeline. These trajectories distill multi-agent interactions into diverse task demonstrations.

The model uses supervised fine-tuning to learn from these trajectories. It also includes tasks such as grounding, refusal training, and UI question-answering to improve element localization, prevent harmful actions, and reduce hallucinations.

Task Type	Purpose	Method	Impact on Fara-7B
Grounding	Improves localization of UI elements such as buttons and links in screenshots.	More than 500,000 samples were generated to map natural language queries to screen coordinates. Omniparser and DOM annotations were used to label elements.	Improves precision for clicking and typing actions. Reduces hallucinations of interactive elements.
Refusal Training	Teaches the model to reject harmful or unsafe tasks.	Synthetic harmful tasks, such as illegal activities and phishing, were used together with public datasets such as WildGuard.	Achieves a 94.2% refusal rate on harmful tasks. Improves safety and compliance.
UI Q&A and Captioning	Strengthens understanding of webpage content and context.	Question-answer pairs and captions were generated from webpage screenshots, with a focus on extracting factual information.	Reduces hallucinations in responses. Improves accuracy when answering user questions about web content.

Fara-7B interprets browser interactions through screenshots, while its internal reasoning and state history are stored as text. Using the latest screenshots together with a full record of previous actions, Fara-7B determines the next action and the required arguments, such as click-location coordinates.

With only 7 billion parameters, Fara-7B achieves state-of-the-art performance for its size. It outperforms comparable models such as UI-TARS-1.5-7B on benchmarks including WebVoyager, where it reaches 73.5% accuracy, and WebTailBench, where it reaches 38.4%. It also remains competitive with much larger models such as GPT-4o.

Evaluating Fara-7B

The researchers evaluated Fara-7B on WebVoyager, Online-Mind2Web, and DeepShop. They also created their own benchmark called WebTailBench.

WebTailBench

WebTailBench includes 609 hand-verified tasks across 11 categories. These categories include shopping, flights, hotels, real estate, job applications, and multi-item shopping lists. The benchmark emphasizes realism by using high-traffic webpages and improves task diversity by covering underrepresented or missing scenarios in existing benchmarks, such as comparison shopping.

WebTailBench supports objective evaluation through goal-oriented tasks and a verification system aligned with human judgment. It also addresses task complexity through multi-step and cross-site challenges. The benchmark is designed for reproducible evaluations and is released together with its verification tools.

Running Fara-7B to Automate Computer Tasks

Training this 7-billion-parameter model required 64 H100 GPUs and 2.5 days. However, this article does not explain how to train Fara-7B from scratch. Instead, it shows how to run the model with Magnetic-UI, where a single H100 GPU server is sufficient.

Start by setting up a GPU-enabled server.

After the server is ready, copy the public IPv4 credentials and connect to it through SSH from your preferred code editor.

In the terminal, run the following commands:

Copy Code

python3 -m venv .venv source .venv/bin/activate pip install magentic-ui[fara] vllm serve "microsoft/Fara-7B" --port 5001 --dtype auto

In your code editor, create a file named fara_config.yaml and paste the following configuration:

Copy Code

model_config_local_surfer: &client_surfer provider: OpenAIChatCompletionClient config: model: "microsoft/Fara-7B" base_url: http://localhost:5001/v1 api_key: not-needed model_info: vision: true function_calling: true json_output: false family: "unknown" structured_output: false multiple_system_messages: false orchestrator_client: *client_surfer coder_client: *client_surfer web_surfer_client: *client_surfer file_surfer_client: *client_surfer action_guard_client: *client_surfer model_client: *client_surfer

Then start Magnetic-UI with the Fara agent:

Copy Code

magentic-ui --fara --port 8081 --config fara_config.yaml

Fara-7B produced the exact correct answer.

Final Thoughts

Fara-7B is an impressive model. High-quality synthetic data at scale can effectively address the data scarcity problem that has slowed the development of Computer Use Agents. This 7-billion-parameter model understands the world through screenshots and can complete complex, multi-step web tasks with state-of-the-art accuracy, giving it significant potential.

It will be interesting to see how progress in related areas of AI research, including computer use, code generation, inference optimization, and other active fields, influences adoption across scalable and high-impact products and use cases.

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

Kimi Linear: Efficient Long-Context AI Inference

AI/ML, Tutorial

5 hours ago

VijonaToday at 14:12 Kimi Linear: A Hardware-Aware Architecture for Efficient Long-Context AI Inference Moonshot AI has introduced another notable release. After the strong impression created by Kimi-K2 and its post-training…

Apache Airflow: Workflow Orchestration Guide

Python, Tutorial

6 hours ago

VijonaToday at 13:48 Apache Airflow: Workflow Orchestration for Data Pipelines Modern organizations that work with data depend on pipelines that collect, transform, enhance, and transfer information from one place to…

Build Faster Agentic LLM Workflows with Python

AI/ML, Tutorial

6 hours ago

VijonaToday at 13:20 Build Faster Agentic LLM Workflows with Asynchronous Python Calls Large language models can be difficult to run reliably in production because they may introduce inaccurate answers, inconsistent…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

How Fara-7B Advances Computer Use Agent Models with Synthetic Web Task Data

What This Article Covers

Key Takeaways

FaraGen: Data Generation for Computer Use Agents

Task Proposal

Task Solving

Orchestrator Agent

WebSurfer Agent

UserSimulator

Trajectory Verification

Fara-7B

Evaluating Fara-7B

WebTailBench

Running Fara-7B to Automate Computer Tasks