Offline Evaluation
Test before you shipRun evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions.
Online Evaluation
Monitor in productionEvaluate real user interactions in real-time to detect issues and measure quality on live traffic.
Evaluation workflow
- Offline evaluation flow
- Online evaluation flow
Create a dataset
Create a dataset with from manually curated test cases, historical production traces, or synthetic data generation.
Define evaluators
Create to score performance:
- Human review
- Code rules
- LLM-as-judge
- Pairwise comparison
Run an experiment
Execute your application on the dataset to create an . Configure repetitions, concurrency, and caching to optimize runs.
Analyze results
Compare experiments for benchmarking, unit tests, regression tests, or backtesting.
Get started
Evaluation quickstart
Get started with offline evaluation.
Manage datasets
Create and manage datasets for evaluation through the UI or SDK.
Run offline evaluations
Explore evaluation types, techniques, and frameworks for comprehensive testing.
Analyze results
View and analyze evaluation results, compare experiments, filter data, and export findings.
Run online evaluations
Monitor production quality in real-time from the Observability tab.
Follow tutorials
Learn by following step-by-step tutorials, from simple chatbots to complex agent evaluations.
要设置 LangSmith 实例,请访问 平台设置部分 以选择云、混合或自托管选项。所有选项都包括可观测性、评估、提示工程和部署。
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

