How Fara-7B Advances Computer Use Agent Models with Synthetic Web Task Data
Training computer use agent (CUA) models has traditionally been challenging. The main reason is the data bottleneck: there is no large-scale collection of real-world human-computer interaction data that researchers can simply use. When you consider how much text data was required to achieve the LLM performance we see today, and how the shortage of high-quality text data was addressed, the next step for improving CUA models may already seem clear.
If you are thinking of synthetic data generation, you are very close.
Because there have not been many existing CUA models available, generating synthetic data directly is difficult. However, one practical alternative is to build a scalable synthetic data generation engine for multi-step web tasks. This is exactly what researchers at Microsoft developed and described in detail in the paper introducing Fara-7B.
What This Article Covers
In this article, we explain how the researchers trained Fara-7B by first looking at how they addressed the shortage of computer-use data through a data engine called FaraGen. We also show how you can test the model yourself and observe Fara-7B completing computer-based tasks in practice.
Key Takeaways
FaraGen is a scalable synthetic data engine for web-based tasks. It uses a multi-agent system to:
- Propose tasks based on real websites,
- Solve tasks through collaborative agents and user feedback to create realistic trajectories,
- Verify trajectories with LLM-based quality checks to produce high-fidelity data.
Fara-7B is a 7-billion-parameter Computer Use Agent model trained with data from FaraGen. It processes screenshots as input and performs complex, multi-step web tasks.
Fara-7B performs better than other CUA models of a similar size on benchmarks such as WebVoyager, Online-Mind2Web, and WebTailBench.
FaraGen: Data Generation for Computer Use Agents
Microsoft researchers introduced FaraGen, a data generation engine designed to create training data for CUA models. In this context, data refers to verified multi-step web trajectories, which FaraGen can generate at roughly one dollar per task. The FaraGen pipeline consists of three main phases: task proposal, task solving, and task verification.
Task Proposal
The task proposal stage focuses on creating realistic tasks. It asks what users would want a CUA to accomplish and what a CUA can realistically do. FaraGen uses high-value URLs from datasets such as ClueWeb22 and Tranco. ClueWeb22 was used more heavily because the researchers considered it to offer broader coverage of useful websites and fewer low-value corporate landing pages.
Some tasks were created from targeted URLs, which made up around 28% of the training data. These tasks were derived from raw URLs and refined into specific user intents that were both achievable and verifiable.
Most tasks came from agentic URL exploration, which accounted for around 67% of the training data. In this process, a multimodal LLM agent explored websites by processing screenshots and accessibility trees. The agent performed iterative actions to complete a task and refined the task based on what had already been done and the current page state.
The remaining tasks, around 5% of the training data, were generated with LLMs by mutating existing tasks into several similar variations.
Task Solving
FaraGen uses a multi-agent framework based on Magnetic-One to solve synthetic web tasks. It generates trajectories that include a complete sequence of observations, actions, and thoughts. These trajectories are then used for supervised fine-tuning to train Fara-7B. Later in this article, we describe how to run this model on a GPU-enabled server environment.
The task-solving process includes two main agents: the Orchestrator and the WebSurfer. A third agent, the UserSimulator, is activated when user input is needed. This enables multi-turn task completion. The system also includes Critical Points, which cause the model to stop and continue only after receiving user instructions.
Orchestrator Agent
The Orchestrator agent coordinates the overall process. Its primary role is to guide the WebSurfer, prevent common failure patterns, enforce Critical Points, and involve the UserSimulator agent when necessary.
The Orchestrator does this by maintaining a ledger. Based on previous and expected future actions from the WebSurfer, the Orchestrator predicts values for the ledger fields.
The is_in_loop and last_action_successful features are especially important because one of the WebSurfer agent’s most common failure modes is becoming stuck in repeated action loops.
Both the Orchestrator and WebSurfer agents can decide to stop at any point, which can create logic conflicts. If the task is not yet complete, the Orchestrator can override the WebSurfer’s stop decision.
Table 3 from the paper describes the decision hierarchy. Critical Points have the strongest authority and override all other flags, while WebSurfer stop decisions have the weakest authority. When the system forces the WebSurfer to stop, other actions are disabled instead of stopping the agent programmatically. This allows the WebSurfer to reason about why it was forced to stop, which helps Fara-7B generalize to new Critical Point scenarios.
After a task is completed, the Orchestrator identifies the URLs of task targets from the history so that verifiers can confirm whether the correct targets were reached.
WebSurfer Agent
The WebSurfer receives its instructions from the next_steps field in the ledger maintained by the Orchestrator. It performs actions such as clicking, typing, and scrolling in the browser through Playwright. The researchers used a managed browser environment to provide stable execution, ensuring that the WebSurfer’s actions on dynamic websites were completed consistently without crashes or timeouts.
UserSimulator
The UserSimulator agent is activated when the pipeline reaches a Critical Point and requires user input. It simulates human responses, such as giving consent or providing personal details, so the data generation process can continue.
Trajectory Verification
Besides the task completion flags used during the task-solving phase, FaraGen also applies several verifiers, which act as LLM judges, to check quality and correctness. Multiple verifiers are needed because different task types require different forms of evidence. Action-oriented tasks benefit from multimodal evidence checks, while information-seeking tasks require rubric-based scoring to assess quality.
| Function | Target Failure Mode | Failure Example |
|---|---|---|
| Alignment verifier | Checks whether the final action history matches the user’s intent. | Logic errors, such as purchasing the wrong item. |
| Rubric verifier | Scores the trajectory against a predefined checklist of criteria. | Partial failures, such as finding a hotel but using the wrong dates. |
| Multimodal verifier | Reviews the final screenshot to confirm visible evidence of success. | Hallucinations, such as claiming the task is complete while the screen shows an error. |
To demonstrate how effective FaraGen is, the researchers used its generated data to train Fara-7B. Fara-7B is a CUA model that understands computer interfaces only through screenshots, performs actions through predicted coordinates, and is compact enough to run on local devices.
Fara-7B
Fara-7B can be understood as a proof-of-concept distillation of the multi-agent solving system. It is based on the Qwen2.5-VL-7B vision-language model and was trained on 145,000 high-quality trajectories generated by the FaraGen pipeline. These trajectories distill multi-agent interactions into diverse task demonstrations.
The model uses supervised fine-tuning to learn from these trajectories. It also includes tasks such as grounding, refusal training, and UI question-answering to improve element localization, prevent harmful actions, and reduce hallucinations.
| Task Type | Purpose | Method | Impact on Fara-7B |
|---|---|---|---|
| Grounding | Improves localization of UI elements such as buttons and links in screenshots. | More than 500,000 samples were generated to map natural language queries to screen coordinates. Omniparser and DOM annotations were used to label elements. | Improves precision for clicking and typing actions. Reduces hallucinations of interactive elements. |
| Refusal Training | Teaches the model to reject harmful or unsafe tasks. | Synthetic harmful tasks, such as illegal activities and phishing, were used together with public datasets such as WildGuard. | Achieves a 94.2% refusal rate on harmful tasks. Improves safety and compliance. |
| UI Q&A and Captioning | Strengthens understanding of webpage content and context. | Question-answer pairs and captions were generated from webpage screenshots, with a focus on extracting factual information. | Reduces hallucinations in responses. Improves accuracy when answering user questions about web content. |
Fara-7B interprets browser interactions through screenshots, while its internal reasoning and state history are stored as text. Using the latest screenshots together with a full record of previous actions, Fara-7B determines the next action and the required arguments, such as click-location coordinates.
With only 7 billion parameters, Fara-7B achieves state-of-the-art performance for its size. It outperforms comparable models such as UI-TARS-1.5-7B on benchmarks including WebVoyager, where it reaches 73.5% accuracy, and WebTailBench, where it reaches 38.4%. It also remains competitive with much larger models such as GPT-4o.
Evaluating Fara-7B
The researchers evaluated Fara-7B on WebVoyager, Online-Mind2Web, and DeepShop. They also created their own benchmark called WebTailBench.
WebTailBench
WebTailBench includes 609 hand-verified tasks across 11 categories. These categories include shopping, flights, hotels, real estate, job applications, and multi-item shopping lists. The benchmark emphasizes realism by using high-traffic webpages and improves task diversity by covering underrepresented or missing scenarios in existing benchmarks, such as comparison shopping.
WebTailBench supports objective evaluation through goal-oriented tasks and a verification system aligned with human judgment. It also addresses task complexity through multi-step and cross-site challenges. The benchmark is designed for reproducible evaluations and is released together with its verification tools.
Running Fara-7B to Automate Computer Tasks
Training this 7-billion-parameter model required 64 H100 GPUs and 2.5 days. However, this article does not explain how to train Fara-7B from scratch. Instead, it shows how to run the model with Magnetic-UI, where a single H100 GPU server is sufficient.
Start by setting up a GPU-enabled server.
After the server is ready, copy the public IPv4 credentials and connect to it through SSH from your preferred code editor.
In the terminal, run the following commands:
python3 -m venv .venv
source .venv/bin/activate
pip install magentic-ui[fara]
vllm serve "microsoft/Fara-7B" --port 5001 --dtype auto
In your code editor, create a file named fara_config.yaml and paste the following configuration:
model_config_local_surfer: &client_surfer
provider: OpenAIChatCompletionClient
config:
model: "microsoft/Fara-7B"
base_url: http://localhost:5001/v1
api_key: not-needed
model_info:
vision: true
function_calling: true
json_output: false
family: "unknown"
structured_output: false
multiple_system_messages: false
orchestrator_client: *client_surfer
coder_client: *client_surfer
web_surfer_client: *client_surfer
file_surfer_client: *client_surfer
action_guard_client: *client_surfer
model_client: *client_surfer
Then start Magnetic-UI with the Fara agent:
magentic-ui --fara --port 8081 --config fara_config.yaml
Fara-7B produced the exact correct answer.
Final Thoughts
Fara-7B is an impressive model. High-quality synthetic data at scale can effectively address the data scarcity problem that has slowed the development of Computer Use Agents. This 7-billion-parameter model understands the world through screenshots and can complete complex, multi-step web tasks with state-of-the-art accuracy, giving it significant potential.
It will be interesting to see how progress in related areas of AI research, including computer use, code generation, inference optimization, and other active fields, influences adoption across scalable and high-impact products and use cases.


