Building More Reliable AI Agents with LangSmith
Introduction
Creating autonomous agents with large language models can often feel unpredictable and difficult to control. Prompts may return different results across runs, tools can fail without obvious warning signs, and models may hallucinate, repeat themselves, or enter loops without clear signals. LangSmith is a platform designed to instrument, debug, evaluate, and monitor LLM applications so these problems become easier to identify and resolve. It gives developers end-to-end visibility into how an agent behaves.
In this tutorial, you will follow a typical reliability workflow for agents: adding tracing, debugging runs both automatically and manually, evaluating performance, improving prompts, and monitoring the agent after deployment. By the end, you will understand what LangSmith is, how it supports agent development, and how it can help you build more dependable AI agents.
Key Takeaways
Observability Is the Foundation of Reliable Agents
LangSmith tracing makes every LLM call, tool execution, and intermediate reasoning step visible. This helps you understand why an agent made a specific decision instead of trying to reconstruct the process from basic logs.
Traces Make Debugging Analytical
Structured traces allow you to inspect failures in detail and identify whether the issue came from a prompt, a tool, retrieval, or orchestration logic. This makes it possible to fix the exact point where the process broke down.
Evaluation Supports Safer Iteration
LangSmith provides dataset-based evaluation workflows for both offline testing and online monitoring. These workflows help you measure quality, compare prompt or model versions, and detect regressions before changes are released to production.
Prompt Engineering Becomes Systematic
The Prompt Playground and prompt versioning features make prompts trackable assets. Teams can collaborate on prompt improvements, reproduce previous behavior, and control how changes are promoted.
Human Feedback Completes the Process
Annotation queues, including single-run and pairwise reviews, help teams use human judgment at scale. This is especially useful when automated metrics do not fully capture quality.
What Is LangSmith and When Is It Useful?
LangSmith is an end-to-end toolkit for building, debugging, testing, evaluating, and deploying LLM-powered applications. With LangSmith, you can:
- Trace each request and capture every step of an agent’s reasoning.
- Evaluate outputs to check the quality of generated responses.
- Iterate on prompts with version control.
- Manage agent deployments.
Unlike some observability tools that are closely tied to a specific LLM application framework, LangSmith is framework-agnostic. It can be used with LangChain, LangGraph, or custom code. This means your agent does not need to be built on LangChain to benefit from LangSmith’s tracing and evaluation features.
LangSmith is useful whenever prompts interact with external tools or APIs, when agents perform multi-step reasoning, or when reproducibility and debugging matter. If simple log statements are no longer enough and you need to understand why an agent behaved in a certain way, LangSmith is worth considering.
The Agent Debugging Problem LangSmith Solves
Developing autonomous agents is challenging because failures are often subtle and difficult to interpret. A small prompt change or a non-deterministic tool response can significantly change the reasoning path. Standard logs and metrics are usually not enough to explain an agent’s actions. You may see an incorrect answer or a timeout, but the root cause may not be obvious.
For example, the issue could be that the agent selected the wrong tool, lacked important context in its prompt, or introduced an error during an LLM call that only became visible later. LangSmith traces are designed to answer these kinds of questions.
Without detailed traces, debugging agents can become guesswork. Agents may fail silently by hallucinating or getting stuck in loops, leaving you to reproduce the problem manually. LangSmith addresses this by recording every step of the agent’s path, including LLM calls, tool invocations, intermediate prompts, and related details, in a structured timeline called a trace.
For example, imagine an agent calls a search tool to retrieve information. If the final answer is wrong, the cause might be one of several things: the search tool may have returned incorrect or empty results, the LLM may have ignored the retrieved information and hallucinated, or the agent logic may have failed to call the tool when it should have.
A LangSmith trace would show each step: the search query, the result returned by the search API, the prompt created from that result, the LLM response, and the following steps. This allows you to quickly identify where the chain stopped following the expected path.
Quickstart: Trace an Agent Run with Python
This section provides a short example of how to use LangSmith to trace an agent run. The example uses Python. LangSmith also provides JavaScript and TypeScript SDKs for similar tracing workflows. In this example, you will instrument a simple LLM call and view the resulting trace in the LangSmith UI.
Prerequisites
Create a free LangSmith account if you do not already have one, and copy your API key. You will also need an API key from your LLM provider, such as OpenAI. Install the required packages:
pip install -U langsmith openai # for this example
1. Enable LangSmith Tracing
LangSmith works with LangChain out of the box. If you already use LangChain, you can enable tracing by setting an environment variable. In your environment or .env file, define the following values:
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langsmith-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
These variables make sure LangSmith tracing is enabled and authenticated. If you use more than one LangSmith workspace, also set LANGSMITH_WORKSPACE_ID accordingly. You can also define these variables in Python using os.environ before initializing any LLM.
Important: The LANGSMITH_TRACING=true flag must be set before your code performs LLM calls, so the SDK can capture them.
2. Wrap or Use an LLM Client
If you use LangChain’s LLM classes, you do not need special code to record traces. You only need to ensure that the environment variable is active. For example, with LangChain’s OpenAI wrapper:
import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "<your-langsmith-api-key>"
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo") # This is LangChain's LLM wrapper
response = llm.predict("Hello, world!")
print(response)
When this runs, LangChain detects that tracing is enabled and logs the call to LangSmith in the background. After executing the code, open the LangSmith web UI and view the trace in the default project’s trace list.
If you are not using LangChain, you can manually instrument LLM calls with the LangSmith SDK. For example, to trace OpenAI API calls directly, LangSmith provides a wrap_openai helper:
from langsmith import wrappers
import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
openai_client = wrappers.wrap_openai(openai.OpenAI()) # wrap the OpenAI client for tracing
# Now any call through openai_client will be traced:
result = openai_client.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[ {"role": "user", "content": "Hello, world?"} ]
)
print(result["choices"][0]["message"]["content"])
In this example, langsmith.wrappers.wrap_openai creates a traced client. LangSmith offers similar wrappers for other providers, including Anthropic. After running the code, the complete trace appears in LangSmith, including the model request and response.
3. View the Trace
Open the LangSmith web app and go to Observability / Traces. You should see a new entry for the run. Click it to open a structured view. Because this example is very simple, the trace may contain only one LLM call node with the prompt and output.
For a more advanced chain or agent, the trace would appear as a tree containing every sub-step, such as the agent’s reasoning flow and each tool call. You can inspect timings, inputs, outputs, annotations, and shareable trace details. This visibility helps you confirm that the agent worked as expected. If something failed, the trace shows where and why it happened.
Tip: In production, you may not want to trace every run to control costs. LangSmith supports sampling traces and enabling or disabling tracing in code. For example, you can trace only a percentage of requests or specific sessions by setting the environment variable dynamically or using LangSmith’s configuration SDK. During development and debugging, enabling tracing globally is usually the simplest option.
Reading Traces
After traces are available, the next step is diagnosing issues. A LangSmith trace works like an experiment log. It helps categorize failures and identify recurring patterns. The following table shows common failure types and how traces help resolve them.
| Failure Type | What It Looks Like | What the Trace Helps You Do or Fix |
|---|---|---|
| Prompt Failures | The final LLM output is incorrect or malformed, even though the tool outputs are correct and sufficient. | Inspect the exact prompt and raw model response to improve instructions, add examples, strengthen formatting rules, add constraints, or adjust the model and settings. |
| Tool Failures | A tool call returns an error, times out, or produces empty or invalid output. | Find the failing tool step and add retries, timeouts, fallbacks, validation, and guardrails. |
| Retrieval or Knowledge Failures | The retrieval step in a RAG workflow returns irrelevant documents or misses key facts, leading to an incorrect answer. | Review the retrieval query and retrieved documents to improve the vector store, chunking, indexing, reranking, or retrieval prompt strategy. |
| Orchestration Failures | The agent or chain logic behaves incorrectly, loops, chooses the wrong branch, orders steps poorly, takes too many steps, or runs very slowly. | Use the run tree and metadata to find loops and bottlenecks, then improve stop conditions, branching logic, tool selection rules, and step structure. |
Think of a trace as a research record. Categorize each issue, such as prompt, tool, retrieval, or orchestration. Then measure the impact, for example whether the issue caused a hard error or only reduced answer quality. LangSmith makes inputs, prompts, tool order, memory or state changes, latency, cost, and error paths explicit and auditable.
Evaluation Workflow: Datasets, Evaluators, and Experiments
Tracing explains what happened during one run. Evaluation shows how good the outcomes are across many runs or when comparing different versions of an agent. LangSmith includes a complete evaluation workflow for offline testing and online quality monitoring.
Offline vs Online Evaluation
LangSmith supports both offline and online evaluation. Offline evaluation happens before releasing changes. You prepare a dataset of example inputs, optionally with expected outputs, and run the agent against all examples to measure performance. This is useful for catching regressions and comparing prompt or model versions in a controlled way.
Online evaluation runs in production. It continuously evaluates real user interactions as they occur, helping you monitor quality on live data. Both approaches are useful: offline evaluation acts like unit or integration testing for agents, while online evaluation works more like live monitoring or canary testing.
1. Create a Dataset
A dataset is a collection of test cases for your agent. These examples can come from previous failures, synthetic cases that are expected to be challenging, or user questions that must be answered reliably. In LangSmith, a dataset is a set of input-output pairs, or inputs only if there is no ground-truth output.
You can create and manage datasets through the UI or SDK. For example, in Python:
from langsmith import Client
client = Client()
dataset = client.create_dataset("MathQA set", description="Simple math Q&A tests")
client.create_examples(dataset.id, examples=[
{"inputs": {"question": "What is 2+2?"}, "outputs": {"answer": "4"}},
{"inputs": {"question": "What is the capital of France?"}, "outputs": {"answer": "Paris"}}
])
This creates a dataset with two Q&A examples. You can also upload datasets from CSV or JSON files or create them directly in the UI.
2. Define Evaluators
Evaluators are functions or models that score your agent’s output for each example. LangSmith includes built-in evaluators and supports custom ones. Common evaluator types include:
- Correctness or truthfulness: Checks whether the output matches a known correct reference answer. This requires reference outputs in the dataset.
- Rubric scoring: Uses a custom Python function or heuristic, such as checking whether an answer includes a citation.
- LLM-as-judge: Uses another LLM to assess output quality. For example, a model can compare the agent output with the reference answer and assign a score.
- Pairwise comparison: Compares two outputs, such as model A versus model B, and determines which is better.
You can attach one or more evaluators to an experiment. LangSmith uses the OpenEvals library for open-source evaluators. For example, to use an LLM-as-judge evaluator for correctness, you can import a prompt such as CORRECTNESS_PROMPT and create a judge:
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT
judge = create_llm_as_judge(prompt=CORRECTNESS_PROMPT)
This judge can score each output by asking an LLM whether the answer is correct based on the reference and returning a score.
3. Run an Experiment
The next step is to run your agent on the dataset and collect results. In the LangSmith UI, you can create a new experiment, choose the dataset, select the target function such as your agent or chain, and attach evaluators.
If you use code, you can run something similar to the following:
client = Client()
# Assuming `target_fn` is your function that takes an input dict and returns an output dict
experiment = client.run_on_dataset(
dataset_name="MathQA set",
func=target_fn,
evaluators=[judge] # the evaluator we defined
)
This runs target_fn for every dataset example, logs the outputs, and lets the evaluators score each result. LangSmith stores this logged data in an Experiment object that you can inspect.
In the UI, an experiment shows a live-updating result table. You can view each example, the model output, and evaluator scores. This helps identify patterns, such as an agent failing on all math questions involving division.
4. Analyze Results
After the experiment finishes, you can compare and analyze the runs in LangSmith. You can sort and filter by score, show only examples where correctness was low, or compare two experiments such as Agent v1 against Agent v2 on the same dataset.
This is helpful for benchmarking prompt versions and models. It also supports regression testing, because you can maintain a golden dataset and verify that future versions do not reduce scores on those examples.
5. Iterate Improvements
If evaluation results are not good enough, refine the agent by improving prompts, adding examples, changing how tools are used, or fixing orchestration logic. Then run another experiment.
A particularly useful workflow is the Prompt Playground in LangSmith. This no-code interface lets you quickly evaluate different prompt variants. You can load a dataset, test different prompt templates or model parameters, and immediately view evaluation scores.
Evaluation does not always need to be fully automated. Automated metrics and LLM judges may miss important quality aspects. In those cases, human review is useful. LangSmith supports this through annotation queues, described in the next section.
Human-in-the-Loop: Annotation Queues for Quality at Scale
Automated evaluators are useful, but human evaluation is often needed for the most accurate quality assessment. LangSmith annotation queues provide a scalable way to review agent outputs. Model outputs can be placed into a queue, where human reviewers such as developers, colleagues, or professional labelers score them according to a predefined rubric.
LangSmith supports two annotation queue styles:
Single-Run Annotation Queues
Single-run queues show one output at a time to the reviewer, along with a rubric or questions. The question might be as simple as “Is this answer factually correct?” or “Rate the answer’s clarity from 1 to 5.” The reviewer sees the relevant context, such as the prompt, output, and possibly a reference answer, then submits feedback for that run.
Pairwise Annotation Queues
Pairwise annotation queues show two outputs side by side and ask the reviewer to choose which one is better or whether they are equal. This is useful for A/B testing, such as comparing Agent Version A and Version B on the same question. Pairwise comparisons are also helpful when comparing a new model or prompt against an older version, which is common in fine-tuning and reinforcement learning from human feedback workflows.
When to Use Each Queue Type
Single-run queues are useful when you need absolute ratings or categorical labels for outputs. They are helpful for building a dataset of human scores to train an evaluator or for monitoring quality.
Pairwise queues are useful when relative quality matters more. For example, you may want to confirm that a new version is better than the previous one and understand how strongly people prefer it. Many teams use pairwise review for preference modeling or reinforcement learning from human feedback because humans often find it easier to choose between two outputs than to assign an absolute score.
LangSmith vs Alternatives
The LLM observability and feedback ecosystem is developing quickly. Several tools offer features that overlap with LangSmith. The following comparison summarizes LangSmith and several common alternatives.
| Tool or Option | What It Is | Best Fit and Trade-Offs |
|---|---|---|
| Langfuse | An open-source platform for LLM tracing, analytics, evaluations, and annotation. It supports self-hosting through a community edition and offers broad integrations beyond LangChain, including multiple SDKs and frameworks such as the OpenAI SDK. | Choose it if full data control, self-hosting, or a custom stack is important. The trade-off is that it may require more operational effort than a hosted service. LangSmith may feel more plug-and-play for LangChain and LangGraph and can be stronger for prompt and version workflows. |
| Helicone | An open-source observability tool that works as an LLM API proxy. Calls are routed through Helicone, often by changing the API base URL, so requests and responses can be logged. It focuses on cost tracking, latency metrics, dashboards, caching, and session tracing for multi-step flows. | Choose it when fast setup for logging and spend monitoring is the priority. The trade-off is that it is usually less deep for hierarchical agent traces and is generally not evaluation-first like LangSmith or Langfuse. It is often used together with other tools, with Helicone handling cost monitoring and another platform handling debugging and evaluation. |
| OpenTelemetry and OpenLLMetry | OpenTelemetry is an open standard for traces, logs, and metrics. GenAI semantic conventions aim to standardize how LLM operations and agent steps are represented. OpenLLMetry provides SDKs to instrument LLM applications and export data to OpenTelemetry backends such as Datadog or Jaeger. | Choose it if vendor-neutral portability is important, if an observability stack already exists, or if LLM telemetry must be integrated into existing APM systems. The trade-off is more setup, including instrumentation and backend configuration, and more generic interfaces unless custom views are built for prompts, tools, and agents. |
| Others such as Phoenix, Arize, HoneyHive, and similar tools | Specialized tools often focused on one area, such as dataset-based evaluation, analytics, or human feedback workflows. | Choose them when a specific capability is the main priority, such as evaluation analytics or human feedback. The trade-off is that multiple tools may be needed to match LangSmith’s combined workflow for tracing, evaluation, prompt operations, and monitoring. |
FAQ
Do I Need LangChain to Use LangSmith?
No. LangSmith is framework-agnostic. You can use it with LangChain, LangGraph, or without a framework. If you already use LangChain, LangSmith integrates with minimal setup by using environment variables that automatically log chains and agents. If you do not use a framework, you can manually trace LLM calls or agent logic with the LangSmith SDK.
Can I Evaluate Prompts Without Writing Code?
Yes. LangSmith provides a Prompt Playground and evaluation UI that allow many workflows without coding. For example, you can create a dataset and run an experiment in the web interface by adding examples, selecting an evaluator, and starting the run.
What Is the Difference Between Single-Run and Pairwise Annotations in LangSmith?
In single-run annotation, a human reviewer sees one output at a time and provides feedback based on a rubric, such as rating correctness or another quality. In pairwise annotation, the reviewer sees two outputs side by side, usually from two different model versions for the same input, and chooses which is better or whether they are equal. Single-run annotation evaluates an output by itself, while pairwise annotation compares outputs.
What Must Be Set So LLM Calls Are Traced by LangSmith?
The most important requirement is enabling tracing through environment variables or configuration. The simplest approach is setting LANGSMITH_TRACING=true before running the application. You should also set LANGSMITH_API_KEY so data is connected to your account, along with any model API keys such as OPENAI_API_KEY.
With LANGSMITH_TRACING=true, supported integrations such as LangChain begin logging automatically. If environment variables are not possible, LangSmith also supports programmatic tracing configuration. For example, in Python:
from langsmith import Client
client = Client()
client.configure_tracing(True)
This is a conceptual example, and the exact API may differ. The key idea is that tracing can be enabled in code. After running the chain or agent, traces should appear in the LangSmith UI. If they do not, check that the environment variables are defined in the same runtime context as the application and that the correct workspace is being used.
Is LangSmith Only for Tracing, or Does It Also Handle Deployments?
LangSmith was designed around observability and evaluation, but it also includes a deployment module. Agents can be deployed as managed endpoints through LangSmith, referred to as Agent Servers, with scaling, uptime, and monitoring included.
Conclusion
LangSmith helps turn agent development from trial and error into a structured engineering workflow. With tracing, evaluation, prompt versioning, and human review, teams can diagnose failures, measure improvements, and prevent regressions from reaching users.
The core habit is the agent reliability loop: trace every run, turn real failures into datasets, run repeatable experiments with evaluators and human review when needed, and promote only the best prompt or model changes through versioned releases.
Tracking and monitoring also help detect drift, tool failures, latency changes, and unexpected cost increases early. For developers building tool-using or multi-step agents, LangSmith provides a structured way to improve reproducibility and disciplined iteration from prototype to production.


