如何使用轨迹评估来评估您的智能体

许多智能体行为只有在使用真实LLM时才会显现，例如智能体决定调用哪个工具、如何格式化响应，或者提示词修改是否会影响整个执行轨迹。LangChain的 agentevals 包提供了专门设计用于使用实时模型测试智能体轨迹的评估器。

本指南涵盖开源的 LangChain agentevals 包，它与LangSmith集成用于轨迹评估。

AgentEvals允许您通过执行 轨迹匹配 或使用 LLM评判 来评估智能体的轨迹（确切的消息序列，包括工具调用）：

轨迹匹配

为给定输入硬编码一个参考轨迹，并通过逐步比较来验证运行。适用于测试定义明确的工作流，您知道预期行为。当您对应该调用哪些工具以及调用顺序有具体期望时使用。这种方法是确定性的、快速的且具有成本效益，因为它不需要额外的LLM调用。

LLM作为评判者

使用LLM定性地验证智能体的执行轨迹。“评判”LLM根据提示词标准（可以包括参考轨迹）审查智能体的决策。更灵活，可以评估效率和适当性等细微方面，但需要LLM调用且确定性较低。当您想评估智能体轨迹的整体质量和合理性，而没有严格的工具调用或顺序要求时使用。

安装AgentEvals

pip install agentevals

npm install agentevals @langchain/core

或者，直接克隆 AgentEvals仓库。

轨迹匹配评估器

AgentEvals在Python中提供 create_trajectory_match_evaluator 函数，在TypeScript中提供 createTrajectoryMatchEvaluator 函数，用于将您的智能体轨迹与参考轨迹进行匹配。您可以使用以下模式：

模式	描述	使用场景
`strict`	消息和工具调用的精确匹配，顺序相同	测试特定序列（例如，授权前先进行策略查询）
`unordered`	允许相同的工具调用以任何顺序出现	验证信息检索，当顺序无关紧要时
`subset`	智能体仅调用参考中的工具（无额外工具）	确保智能体不超过预期范围
`superset`	智能体至少调用参考中的工具（允许额外工具）	验证采取了最低要求的操作

严格匹配

strict 模式确保轨迹包含相同顺序的相同消息和相同的工具调用，但允许消息内容存在差异。当您需要强制执行特定操作序列时（例如，要求在授权操作前进行策略查询），这很有用。

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """获取城市的天气信息。"""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("gpt-5.4", tools=[get_weather])

evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="strict",
)

def test_weather_tool_called_strict():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in San Francisco?")]
    })

    reference_trajectory = [
        HumanMessage(content="What's the weather in San Francisco?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "San Francisco"}}
        ]),
        ToolMessage(content="It's 75 degrees and sunny in San Francisco.", tool_call_id="call_1"),
        AIMessage(content="The weather in San Francisco is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory
    )
    # {
    #     'key': 'trajectory_strict_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True

import { createAgent, tool, HumanMessage, AIMessage, ToolMessage } from "langchain"
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({
      city: z.string(),
    }),
  }
);

const agent = createAgent({
  model: "gpt-5.4",
  tools: [getWeather]
});

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "strict",
});

async function testWeatherToolCalledStrict() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in San Francisco?")]
  });

  const referenceTrajectory = [
    new HumanMessage("What's the weather in San Francisco?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "San Francisco" } }
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in San Francisco.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in San Francisco is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory
  });
  // {
  //     'key': 'trajectory_strict_match',
  //     'score': true,
  //     'comment': null,
  // }
  expect(evaluation.score).toBe(true);
}

无序匹配

unordered 模式允许相同的工具调用以任何顺序出现，当您想验证是否调用了正确的工具集但不关心顺序时，这很有帮助。例如，智能体可能需要同时检查城市的天气和活动，但顺序无关紧要。

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """获取城市的天气信息。"""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_events(city: str):
    """获取城市中发生的活动。"""
    return f"Concert at the park in {city} tonight."

agent = create_agent("gpt-5.4", tools=[get_weather, get_events])

evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="unordered",
)

def test_multiple_tools_any_order():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's happening in SF today?")]
    })

    # 参考轨迹显示的工具调用顺序与实际执行不同
    reference_trajectory = [
        HumanMessage(content="What's happening in SF today?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_events", "args": {"city": "SF"}},
            {"id": "call_2", "name": "get_weather", "args": {"city": "SF"}},
        ]),
        ToolMessage(content="Concert at the park in SF tonight.", tool_call_id="call_1"),
        ToolMessage(content="It's 75 degrees and sunny in SF.", tool_call_id="call_2"),
        AIMessage(content="Today in SF: 75 degrees and sunny with a concert at the park tonight."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_unordered_match',
    #     'score': True,
    # }
    assert evaluation["score"] is True

import { createAgent, tool, HumanMessage, AIMessage, ToolMessage } from "langchain"
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const getEvents = tool(
  async ({ city }: { city: string }) => {
    return `Concert at the park in ${city} tonight.`;
  },
  {
    name: "get_events",
    description: "Get events happening in a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-5.4",
  tools: [getWeather, getEvents]
});

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "unordered",
});

async function testMultipleToolsAnyOrder() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's happening in SF today?")]
  });

  // 参考轨迹显示的工具调用顺序与实际执行不同
  const referenceTrajectory = [
    new HumanMessage("What's happening in SF today?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_events", args: { city: "SF" } },
        { id: "call_2", name: "get_weather", args: { city: "SF" } },
      ]
    }),
    new ToolMessage({
      content: "Concert at the park in SF tonight.",
      tool_call_id: "call_1"
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in SF.",
      tool_call_id: "call_2"
    }),
    new AIMessage("Today in SF: 75 degrees and sunny with a concert at the park tonight."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  // {
  //     'key': 'trajectory_unordered_match',
  //     'score': true,
  // }
  expect(evaluation.score).toBe(true);
}

子集和超集匹配

superset 和 subset 模式关注调用了哪些工具，而不是工具调用的顺序，允许您控制智能体的工具调用必须与参考轨迹对齐的严格程度。

当您想验证执行中调用了几个关键工具，但允许智能体调用额外工具时，使用 superset 模式。智能体的轨迹必须至少包含参考轨迹中的所有工具调用，并且可以包含超出参考轨迹的额外工具调用。
使用 subset 模式通过验证智能体没有调用参考轨迹之外的任何无关或不必要的工具来确保智能体效率。智能体的轨迹必须仅包含参考轨迹中出现的工具调用。

以下示例演示了 superset 模式，其中参考轨迹仅要求 get_weather 工具，但智能体可以调用额外工具：

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """获取城市的天气信息。"""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_detailed_forecast(city: str):
    """获取城市的详细天气预报。"""
    return f"Detailed forecast for {city}: sunny all week."

agent = create_agent("gpt-5.4", tools=[get_weather, get_detailed_forecast])

evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="superset",
)

def test_agent_calls_required_tools_plus_extra():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Boston?")]
    })

    # 参考轨迹仅要求 get_weather，但智能体可以调用额外工具
    reference_trajectory = [
        HumanMessage(content="What's the weather in Boston?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "Boston"}},
        ]),
        ToolMessage(content="It's 75 degrees and sunny in Boston.", tool_call_id="call_1"),
        AIMessage(content="The weather in Boston is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_superset_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True

import { createAgent } from "langchain"
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const getDetailedForecast = tool(
  async ({ city }: { city: string }) => {
    return `Detailed forecast for ${city}: sunny all week.`;
  },
  {
    name: "get_detailed_forecast",
    description: "Get detailed weather forecast for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-5.4",
  tools: [getWeather, getDetailedForecast]
});

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "superset",
});

async function testAgentCallsRequiredToolsPlusExtra() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Boston?")]
  });

  // 参考轨迹仅要求 getWeather，但智能体可以调用额外工具
  const referenceTrajectory = [
    new HumanMessage("What's the weather in Boston?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "Boston" } },
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in Boston.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in Boston is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  // {
  //     'key': 'trajectory_superset_match',
  //     'score': true,
  //     'comment': null,
  // }
  expect(evaluation.score).toBe(true);
}

您还可以通过设置 tool_args_match_mode（Python）或 toolArgsMatchMode（TypeScript）属性，以及 tool_args_match_overrides（Python）或 toolArgsMatchOverrides（TypeScript）属性来自定义评估器如何考虑实际轨迹与参考轨迹中工具调用的相等性。默认情况下，只有对相同工具具有相同参数的工具调用才被视为相等。访问仓库了解更多详情。

LLM作为评判者评估器

本节涵盖 agentevals 包中特定于轨迹的LLM作为评判者评估器。有关LangSmith中通用的LLM作为评判者评估器，请参阅 LLM作为评判者评估器。

您也可以使用LLM来评估智能体的执行路径。与轨迹匹配评估器不同，它不需要参考轨迹，但如果有的话可以提供。

无参考轨迹

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT


@tool
def get_weather(city: str):
    """获取城市的天气信息。"""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("gpt-5.4", tools=[get_weather])

evaluator = create_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT,
)

def test_trajectory_quality():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Seattle?")]
    })

    evaluation = evaluator(
        outputs=result["messages"],
    )
    # {
    #     'key': 'trajectory_accuracy',
    #     'score': True,
    #     'comment': 'The provided agent trajectory is reasonable...'
    # }
    assert evaluation["score"] is True

import { createAgent } from "langchain"
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-5.4",
  tools: [getWeather]
});

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

async function testTrajectoryQuality() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Seattle?")]
  });

  const evaluation = await evaluator({
    outputs: result.messages,
  });
  // {
  //     'key': 'trajectory_accuracy',
  //     'score': true,
  //     'comment': 'The provided agent trajectory is reasonable...'
  // }
  expect(evaluation.score).toBe(true);
}

有参考轨迹

如果您有参考轨迹，可以在提示词中添加一个额外变量并传入参考轨迹。下面，我们使用预构建的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示词并配置 reference_outputs 变量：

evaluator = create_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
)
evaluation = evaluator(
    outputs=result["messages"],
    reference_outputs=reference_trajectory,
)

import { TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE } from "agentevals";

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
});

const evaluation = await evaluator({
  outputs: result.messages,
  referenceOutputs: referenceTrajectory,
});

有关如何配置LLM评估轨迹的更多选项，请访问仓库。

异步支持（Python）

所有 agentevals 评估器都支持Python asyncio。对于使用工厂函数的评估器，通过在函数名中的 create_ 后添加 async 可以获得异步版本。以下是使用异步评判者和评估器的示例：

from agentevals.trajectory.llm import create_async_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
from agentevals.trajectory.match import create_async_trajectory_match_evaluator

async_judge = create_async_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT,
)

async_evaluator = create_async_trajectory_match_evaluator(
    trajectory_match_mode="strict",
)

async def test_async_evaluation():
    result = await agent.ainvoke({
        "messages": [HumanMessage(content="What's the weather?")]
    })

    evaluation = await async_judge(outputs=result["messages"])
    assert evaluation["score"] is True

将这些文档连接到Claude、VSCode等，通过MCP获取实时答案。

在GitHub上编辑此页面或提交问题。

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

如何使用轨迹评估来评估您的智能体

轨迹匹配

LLM作为评判者

安装AgentEvals

轨迹匹配评估器

严格匹配

无序匹配

子集和超集匹配

LLM作为评判者评估器

无参考轨迹

有参考轨迹

异步支持（Python）

轨迹匹配

LLM作为评判者

​安装AgentEvals

​轨迹匹配评估器

​严格匹配

​无序匹配

​子集和超集匹配

​LLM作为评判者评估器

​无参考轨迹

​有参考轨迹

​异步支持（Python）

安装AgentEvals

轨迹匹配评估器

严格匹配

无序匹配

子集和超集匹配

LLM作为评判者评估器

无参考轨迹

有参考轨迹

异步支持（Python）