智能体评估

评估（“evals”）通过评估智能体的执行轨迹（即其产生的消息和工具调用序列）来衡量其表现。与验证基本正确性的集成测试不同，评估会根据参考标准或评分标准对智能体行为进行评分，这使得它们在更改提示、工具或模型时对于捕获回归问题非常有用。评估器是一个函数，它接收智能体输出（以及可选的参考输出）并返回一个分数：

function evaluator({ outputs, referenceOutputs }: {
  outputs: Record<string, any>;
  referenceOutputs: Record<string, any>;
}) {
  const outputMessages = outputs.messages;
  const referenceMessages = referenceOutputs.messages;
  const score = compareMessages(outputMessages, referenceMessages);
  return { key: "evaluator_score", score: score };
}

agentevals 包提供了用于智能体轨迹的预构建评估器。你可以通过执行轨迹匹配（确定性比较）或使用 LLM 评判（定性评估）来进行评估：

方法	适用场景
轨迹匹配	你知道预期的工具调用，并希望进行快速、确定性、无成本的检查
LLM 作为评判	你希望评估整体质量和推理过程，而没有严格的预期

安装 AgentEvals

npm install agentevals @langchain/core

或者，直接克隆 AgentEvals 仓库。

轨迹匹配评估器

AgentEvals 提供了 createTrajectoryMatchEvaluator 函数，用于将你的智能体轨迹与参考轨迹进行匹配。有四种模式：

模式	描述	用例
`strict`	消息结构和工具调用完全按相同顺序精确匹配（消息内容可以不同）	测试特定序列（例如，授权前先进行策略查询）
`unordered`	消息结构和工具调用与参考相同，但工具调用可以以任何顺序发生	验证信息检索，顺序无关紧要时
`subset`	智能体仅调用参考中的工具（没有额外工具）	确保智能体不超出预期范围
`superset`	智能体至少调用了参考中的工具（允许额外工具）	验证已采取最低要求的操作

以下示例共享一个通用设置，即一个带有 get_weather 工具的智能体：

import { createAgent } from "langchain";
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "claude-sonnet-4-6",
  tools: [getWeather],
});

严格匹配

strict 模式确保轨迹包含相同顺序的相同消息和相同的工具调用，但允许消息内容存在差异。当你需要强制执行特定操作序列时（例如，要求在授权操作前进行策略查询），这很有用。

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "strict",
});

async function testWeatherToolCalledStrict() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in San Francisco?")]
  });

  const referenceTrajectory = [
    new HumanMessage("What's the weather in San Francisco?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "San Francisco" } }
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in San Francisco.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in San Francisco is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory
  });
  expect(evaluation.score).toBe(true);
}

无序匹配

unordered 模式允许相同的工具调用以任何顺序出现。当你想验证是否检索到了特定信息但不在乎顺序时，这很有帮助。例如，一个智能体使用不同的工具调用来检查城市的天气和活动。

const getEvents = tool(
  async ({ city }: { city: string }) => {
    return `Concert at the park in ${city} tonight.`;
  },
  {
    name: "get_events",
    description: "Get events happening in a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "claude-sonnet-4-6",
  tools: [getWeather, getEvents],
});

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "unordered",
});

async function testMultipleToolsAnyOrder() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's happening in SF today?")]
  });

  const referenceTrajectory = [
    new HumanMessage("What's happening in SF today?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_events", args: { city: "SF" } },
        { id: "call_2", name: "get_weather", args: { city: "SF" } },
      ]
    }),
    new ToolMessage({
      content: "Concert at the park in SF tonight.",
      tool_call_id: "call_1"
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in SF.",
      tool_call_id: "call_2"
    }),
    new AIMessage("Today in SF: 75 degrees and sunny with a concert at the park tonight."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  expect(evaluation.score).toBe(true);
}

子集和超集匹配

superset 和 subset 模式匹配部分轨迹。superset 模式验证智能体至少调用了参考轨迹中的工具，允许额外的工具调用。subset 模式确保智能体没有调用参考轨迹之外的任何工具。

const getDetailedForecast = tool(
  async ({ city }: { city: string }) => {
    return `Detailed forecast for ${city}: sunny all week.`;
  },
  {
    name: "get_detailed_forecast",
    description: "Get detailed weather forecast for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "claude-sonnet-4-6",
  tools: [getWeather, getDetailedForecast],
});

const evaluator = createTrajectoryMatchEvaluator({
  trajectoryMatchMode: "superset",
});

async function testAgentCallsRequiredToolsPlusExtra() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Boston?")]
  });

  const referenceTrajectory = [
    new HumanMessage("What's the weather in Boston?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "Boston" } },
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in Boston.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in Boston is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  expect(evaluation.score).toBe(true);
}

你还可以设置 toolArgsMatchMode 属性和/或 toolArgsMatchOverrides 来自定义评估器如何考虑实际轨迹与参考轨迹中工具调用之间的相等性。默认情况下，只有对相同工具使用相同参数的工具调用才被视为相等。访问仓库了解更多详情。

LLM 作为评判的评估器

你可以使用 LLM 通过 createTrajectoryLLMAsJudge 函数来评估智能体的执行路径。与轨迹匹配评估器不同，它不需要参考轨迹，但如果有的话可以提供。

无参考轨迹

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

async function testTrajectoryQuality() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Seattle?")]
  });

  const evaluation = await evaluator({
    outputs: result.messages,
  });
  expect(evaluation.score).toBe(true);
}

有参考轨迹

如果你有参考轨迹，请使用预构建的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示：

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE } from "agentevals";

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
});

const evaluation = await evaluator({
  outputs: result.messages,
  referenceOutputs: referenceTrajectory,
});

要了解更多关于 LLM 如何评估轨迹的可配置性，请访问仓库。

在 LangSmith 中运行评估

为了随时间跟踪实验，请将评估器结果记录到 LangSmith。首先，设置所需的环境变量：

export LANGSMITH_API_KEY="your_langsmith_api_key"
export LANGSMITH_TRACING="true"

LangSmith 提供两种主要的运行评估方法：Vitest/Jest 集成和 evaluate 函数。

使用 vitest/jest 集成

import * as ls from "langsmith/vitest";
// import * as ls from "langsmith/jest";

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

ls.describe("trajectory accuracy", () => {
  ls.test("accurate trajectory", {
    inputs: {
      messages: [
        { role: "user", content: "What is the weather in SF?" }
      ]
    },
    referenceOutputs: {
      messages: [
        new HumanMessage("What is the weather in SF?"),
        new AIMessage({
          content: "",
          tool_calls: [
            { id: "call_1", name: "get_weather", args: { city: "SF" } }
          ]
        }),
        new ToolMessage({
          content: "It's 75 degrees and sunny in SF.",
          tool_call_id: "call_1"
        }),
        new AIMessage("The weather in SF is 75 degrees and sunny."),
      ],
    },
  }, async ({ inputs, referenceOutputs }) => {
    const result = await agent.invoke({
      messages: [new HumanMessage("What is the weather in SF?")]
    });

    ls.logOutputs({ messages: result.messages });

    await trajectoryEvaluator({
      inputs,
      outputs: result.messages,
      referenceOutputs,
    });
  });
});

使用你的测试运行器运行评估：

vitest run test_trajectory.eval.ts
# 或
jest test_trajectory.eval.ts

使用 evaluate 函数

创建一个 LangSmith 数据集并使用 evaluate 函数。数据集必须具有以下模式：

input: {"messages": [...]} 用于调用智能体的输入消息。
output: {"messages": [...]} 智能体输出中的预期消息历史记录。对于轨迹评估，你可以选择仅保留助手消息。

import { evaluate } from "langsmith/evaluation";
import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

async function runAgent(inputs: any) {
  const result = await agent.invoke(inputs);
  return result.messages;
}

await evaluate(
  runAgent,
  {
    data: "your_dataset_name",
    evaluators: [trajectoryEvaluator],
  }
);

要了解更多关于评估智能体的信息，请参阅 LangSmith 文档。

将这些文档通过 MCP 连接到 Claude、VSCode 等，以获取实时答案。

在 GitHub 上编辑此页面或提交问题。

Get started

Core components

Middleware

Frontend

Advanced usage

Agent development

Deploy with LangSmith

安装 AgentEvals

轨迹匹配评估器

LLM 作为评判的评估器

在 LangSmith 中运行评估

​安装 AgentEvals

​轨迹匹配评估器

​LLM 作为评判的评估器

​在 LangSmith 中运行评估

安装 AgentEvals

轨迹匹配评估器

LLM 作为评判的评估器

在 LangSmith 中运行评估