智能体应用让 LLM 自主决定下一步操作来解决问题。这种灵活性非常强大,但模型的黑盒特性使得难以预测对代理某一部分的调整会如何影响其余部分。要构建生产就绪的代理,全面的测试必不可少。
测试代理的方法主要有以下几种:
单元测试 :使用内存模拟对代理中小而确定性的部分进行隔离测试,可以快速且确定性地验证精确行为。
集成测试 :使用真实的网络调用测试代理,以确认各组件协同工作、凭证和模式正确匹配,并且延迟在可接受范围内。
由于智能体应用会链式组合多个组件,并且需要应对 LLM 不确定性带来的不稳定性,集成测试在这类应用中通常更为重要。
单元测试
模拟聊天模型
对于不需要 API 调用的逻辑,可以使用内存存根来模拟响应。
LangChain 提供了 GenericFakeChatModel 用于模拟文本响应。它接受一个响应迭代器(AIMessage 或字符串),每次调用返回一个响应,支持普通调用和流式调用。
from langchain_core . language_models . fake_chat_models import GenericFakeChatModel
model = GenericFakeChatModel ( messages = iter ([
AIMessage ( content = "" , tool_calls = [ ToolCall ( name = "foo" , args = { "bar" : "baz" }, id = "call_1" )]),
"bar"
]))
model . invoke ( "hello" )
# AIMessage(content='', ..., tool_calls=[{'name': 'foo', 'args': {'bar': 'baz'}, 'id': 'call_1', 'type': 'tool_call'}])
再次调用模型时,将返回迭代器中的下一项:
model . invoke ( "hello, again!" )
# AIMessage(content='bar', ...)
InMemorySaver 检查点
在测试中启用持久化时,可以使用 InMemorySaver 检查点。这允许你模拟多轮对话,以测试依赖状态的行为:
from langgraph . checkpoint . memory import InMemorySaver
agent = create_agent (
model ,
tools = [],
checkpointer = InMemorySaver ()
)
# First invocation
agent . invoke (
{ "messages" : [ HumanMessage ( content = "I live in Sydney, Australia" )]},
config = { "configurable" : { "thread_id" : "session-1" }}
)
# Second invocation: the first message is persisted (Sydney location), so the model returns GMT+10 time
agent . invoke (
{ "messages" : [ HumanMessage ( content = "What's my local time?" )]},
config = { "configurable" : { "thread_id" : "session-1" }}
)
集成测试
许多代理行为只有在使用真实 LLM 时才会显现,例如代理决定调用哪个工具、如何格式化响应,或者提示词修改是否影响整个执行轨迹。LangChain 的 agentevals 包提供了专为测试代理轨迹(使用实时模型)设计的评估器。
AgentEvals 让你可以通过轨迹匹配 或LLM 裁判 的方式,轻松评估代理的执行轨迹(消息的精确序列,包括工具调用):
轨迹匹配 为给定输入硬编码参考轨迹,并通过逐步对比来验证运行结果。 适用于测试行为明确的工作流,当你知道期望调用哪些工具以及调用顺序时使用。该方法确定性强、速度快、成本低,因为不需要额外的 LLM 调用。
LLM 裁判 使用 LLM 定性评估代理的执行轨迹。“裁判”LLM 根据提示词评分标准(可包含参考轨迹)来审查代理的决策。 更灵活,可以评估效率和合理性等细微方面,但需要 LLM 调用且确定性较低。当你希望评估代理轨迹的整体质量和合理性,而不严格要求工具调用顺序时使用。
安装 AgentEvals
或者,直接克隆 AgentEvals 仓库 。
轨迹匹配评估器
AgentEvals 提供了 create_trajectory_match_evaluator 函数,用于将代理的执行轨迹与参考轨迹进行匹配。有四种模式可供选择:
模式 描述 适用场景 strict消息和工具调用完全按顺序精确匹配 测试特定序列(如:先查策略再授权) unordered允许相同工具调用以任意顺序出现 验证信息检索,不关心顺序 subset代理只调用参考中包含的工具(不能超出) 确保代理不超出预期范围 superset代理至少调用参考中的工具(允许额外调用) 验证最低要求的操作均已执行
strict 模式确保轨迹包含相同顺序的相同消息和工具调用,但允许消息内容有差异。当需要强制执行特定操作序列时(如要求在授权操作前先查询策略)非常有用。from langchain . agents import create_agent
from langchain . tools import tool
from langchain . messages import HumanMessage , AIMessage , ToolMessage
from agentevals . trajectory . match import create_trajectory_match_evaluator
@tool
def get_weather ( city : str ):
"""Get weather information for a city."""
return f "It's 75 degrees and sunny in { city } ."
agent = create_agent ( "gpt-4.1" , tools = [ get_weather ])
evaluator = create_trajectory_match_evaluator (
trajectory_match_mode = "strict" ,
)
def test_weather_tool_called_strict ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in San Francisco?" )]
})
reference_trajectory = [
HumanMessage ( content = "What's the weather in San Francisco?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_weather" , "args" : { "city" : "San Francisco" }}
]),
ToolMessage ( content = "It's 75 degrees and sunny in San Francisco." , tool_call_id = "call_1" ),
AIMessage ( content = "The weather in San Francisco is 75 degrees and sunny." ),
]
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory
)
# {
# 'key': 'trajectory_strict_match',
# 'score': True,
# 'comment': None,
# }
assert evaluation [ " score " ] is True
unordered 模式允许相同的工具调用以任意顺序出现,适用于你需要验证特定信息被检索但不关心顺序的场景。例如,代理可能需要同时查询城市的天气和活动,但顺序无关紧要。from langchain . agents import create_agent
from langchain . tools import tool
from langchain . messages import HumanMessage , AIMessage , ToolMessage
from agentevals . trajectory . match import create_trajectory_match_evaluator
@tool
def get_weather ( city : str ):
"""Get weather information for a city."""
return f "It's 75 degrees and sunny in { city } ."
@tool
def get_events ( city : str ):
"""Get events happening in a city."""
return f "Concert at the park in { city } tonight."
agent = create_agent ( "gpt-4.1" , tools = [ get_weather , get_events ])
evaluator = create_trajectory_match_evaluator (
trajectory_match_mode = "unordered" ,
)
def test_multiple_tools_any_order ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's happening in SF today?" )]
})
# Reference shows tools called in different order than actual execution
reference_trajectory = [
HumanMessage ( content = "What's happening in SF today?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_events" , "args" : { "city" : "SF" }},
{ "id" : "call_2" , "name" : "get_weather" , "args" : { "city" : "SF" }},
]),
ToolMessage ( content = "Concert at the park in SF tonight." , tool_call_id = "call_1" ),
ToolMessage ( content = "It's 75 degrees and sunny in SF." , tool_call_id = "call_2" ),
AIMessage ( content = "Today in SF: 75 degrees and sunny with a concert at the park tonight." ),
]
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory ,
)
# {
# 'key': 'trajectory_unordered_match',
# 'score': True,
# }
assert evaluation [ " score " ] is True
superset 和 subset 模式用于匹配部分轨迹。superset 模式验证代理至少调用了参考轨迹中的工具(允许额外调用);subset 模式确保代理未调用参考之外的工具。from langchain . agents import create_agent
from langchain . tools import tool
from langchain . messages import HumanMessage , AIMessage , ToolMessage
from agentevals . trajectory . match import create_trajectory_match_evaluator
@tool
def get_weather ( city : str ):
"""Get weather information for a city."""
return f "It's 75 degrees and sunny in { city } ."
@tool
def get_detailed_forecast ( city : str ):
"""Get detailed weather forecast for a city."""
return f "Detailed forecast for { city } : sunny all week."
agent = create_agent ( "gpt-4.1" , tools = [ get_weather , get_detailed_forecast ])
evaluator = create_trajectory_match_evaluator (
trajectory_match_mode = "superset" ,
)
def test_agent_calls_required_tools_plus_extra ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in Boston?" )]
})
# Reference only requires get_weather, but agent may call additional tools
reference_trajectory = [
HumanMessage ( content = "What's the weather in Boston?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_weather" , "args" : { "city" : "Boston" }},
]),
ToolMessage ( content = "It's 75 degrees and sunny in Boston." , tool_call_id = "call_1" ),
AIMessage ( content = "The weather in Boston is 75 degrees and sunny." ),
]
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory ,
)
# {
# 'key': 'trajectory_superset_match',
# 'score': True,
# 'comment': None,
# }
assert evaluation [ " score " ] is True
你也可以设置 tool_args_match_mode 属性和/或 tool_args_match_overrides 来自定义评估器如何比较实际轨迹与参考轨迹中的工具调用是否相等。默认情况下,只有参数相同且调用同一工具时,工具调用才被视为相等。详见仓库 。
LLM 裁判评估器
你还可以使用 create_trajectory_llm_as_judge 函数,让 LLM 评估代理的执行路径。与轨迹匹配评估器不同,它不需要参考轨迹,但如果有参考轨迹也可以提供。
from langchain . agents import create_agent
from langchain . tools import tool
from langchain . messages import HumanMessage , AIMessage , ToolMessage
from agentevals . trajectory . llm import create_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
@tool
def get_weather ( city : str ):
"""Get weather information for a city."""
return f "It's 75 degrees and sunny in { city } ."
agent = create_agent ( "gpt-4.1" , tools = [ get_weather ])
evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
def test_trajectory_quality ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in Seattle?" )]
})
evaluation = evaluator (
outputs = result [ " messages " ],
)
# {
# 'key': 'trajectory_accuracy',
# 'score': True,
# 'comment': 'The provided agent trajectory is reasonable...'
# }
assert evaluation [ " score " ] is True
如果你有参考轨迹,可以在提示词中添加额外变量并传入参考轨迹。下面使用预置的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示词,并配置 reference_outputs 变量: evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE ,
)
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory ,
)
如需对 LLM 如何评估轨迹进行更精细的配置,请访问仓库 。
异步支持
所有 agentevals 评估器都支持 Python asyncio。对于使用工厂函数的评估器,在函数名中 create_ 之后加上 async_ 即可获得异步版本。
from agentevals . trajectory . llm import create_async_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
from agentevals . trajectory . match import create_async_trajectory_match_evaluator
async_judge = create_async_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
async_evaluator = create_async_trajectory_match_evaluator (
trajectory_match_mode = "strict" ,
)
async def test_async_evaluation ():
result = await agent . ainvoke ({
"messages" : [ HumanMessage ( content = "What's the weather?" )]
})
evaluation = await async_judge ( outputs = result [ " messages " ])
assert evaluation [ " score " ] is True
LangSmith 集成
若要随时间追踪实验结果,可以将评估器结果记录到 LangSmith ,这是一个用于构建生产级 LLM 应用的平台,提供追踪、评估和实验工具。
首先,设置所需的环境变量来配置 LangSmith:
export LANGSMITH_API_KEY = "your_langsmith_api_key"
export LANGSMITH_TRACING = "true"
LangSmith 提供两种主要的评估运行方式:pytest 集成和 evaluate 函数。
import pytest
from langsmith import testing as t
from agentevals . trajectory . llm import create_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
trajectory_evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
@pytest . mark . langsmith
def test_trajectory_accuracy ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in SF?" )]
})
reference_trajectory = [
HumanMessage ( content = "What's the weather in SF?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_weather" , "args" : { "city" : "SF" }},
]),
ToolMessage ( content = "It's 75 degrees and sunny in SF." , tool_call_id = "call_1" ),
AIMessage ( content = "The weather in SF is 75 degrees and sunny." ),
]
# Log inputs, outputs, and reference outputs to LangSmith
t . log_inputs ({})
t . log_outputs ({ "messages" : result [ " messages " ]})
t . log_reference_outputs ({ "messages" : reference_trajectory })
trajectory_evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory
)
使用 pytest 运行评估: pytest test_trajectory.py --langsmith-output
结果将自动记录到 LangSmith。
或者,你可以在 LangSmith 中创建数据集并使用 evaluate 函数: from langsmith import Client
from agentevals . trajectory . llm import create_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
client = Client ()
trajectory_evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
def run_agent ( inputs ):
"""Your agent function that returns trajectory messages."""
return agent . invoke ( inputs )[ "messages" ]
experiment_results = client . evaluate (
run_agent ,
data = "your_dataset_name" ,
evaluators = [ trajectory_evaluator ]
)
结果将自动记录到 LangSmith。
录制与重放 HTTP 调用
调用真实 LLM API 的集成测试可能速度慢且成本高,尤其是在 CI/CD 流水线中频繁运行时。我们建议使用录制 HTTP 请求和响应的库,在后续运行中直接重放,无需真实网络调用。
你可以使用 vcrpy 来实现这一点。如果你使用 pytest,pytest-recording 插件 提供了一种配置极简的方式。请求/响应录制在磁带(cassette)文件中,后续运行时用于模拟真实网络调用。
设置 conftest.py 文件,过滤磁带中的敏感信息:
import pytest
@pytest . fixture ( scope = "session" )
def vcr_config ():
return {
"filter_headers" : [
( "authorization" , "XXXX" ),
( "x-api-key" , "XXXX" ),
# ... other headers you want to mask
],
"filter_query_parameters" : [
( "api_key" , "XXXX" ),
( "key" , "XXXX" ),
],
}
然后配置项目以识别 vcr 标记:
[ pytest ]
markers =
vcr: record/replay HTTP via VCR
addopts = -- record-mode = once
--record-mode=once 选项在首次运行时录制 HTTP 交互,后续运行时重放。
现在,只需为测试添加 vcr 标记:
@pytest . mark . vcr ()
def test_agent_trajectory ():
# ...
首次运行该测试时,代理将发起真实网络调用,pytest 将在 tests/cassettes 目录下生成磁带文件 test_agent_trajectory.yaml。后续运行将使用该磁带模拟真实网络调用,前提是代理的请求与上次运行一致。若请求发生变化,测试将失败,你需要删除磁带文件并重新运行测试以录制新的交互。
当你修改提示词、添加新工具或更改预期轨迹时,已保存的磁带将过时,现有测试将会失败 。你应该删除相应的磁带文件并重新运行测试以录制新的交互。
通过 MCP 将这些文档连接 到 Claude、VSCode 等以获取实时解答。