评估 Agent 性能

要评估您的 agent 性能，您可以使用 LangSmith 评估。对于评估，您必须首先定义一个评估器函数来判断 agent 的结果，例如最终输出或轨迹。根据您的评估技术，这可能涉及也可能不涉及参考输出。

在 LangSmith 中使用数据集和 LLM-as-judge（LLM 作为裁判）评估器运行评估。请参阅评估快速入门以开始使用。

type EvaluatorParams = {
    outputs: Record<string, any>;
    referenceOutputs: Record<string, any>;
};

function evaluator({ outputs, referenceOutputs }: EvaluatorParams) {
    // compare agent outputs against reference outputs
    const outputMessages = outputs.messages;
    const referenceMessages = referenceOutputs.messages;
    const score = compareMessages(outputMessages, referenceMessages);
    return { key: "evaluator_score", score: score };
}

首先，您可以使用来自 AgentEvals 包的预构建评估器：

npm install agentevals

创建评估器

评估 agent 性能的一种常见方法是比较其轨迹（它调用工具的顺序）与参考轨迹：

import { createTrajectoryMatchEvaluator } from "agentevals/trajectory/match";

const outputs = [
    {
        role: "assistant",
        tool_calls: [
        {
            function: {
            name: "get_weather",
            arguments: JSON.stringify({ city: "san francisco" }),
            },
        },
        {
            function: {
            name: "get_directions",
            arguments: JSON.stringify({ destination: "presidio" }),
            },
        },
        ],
    },
];

const referenceOutputs = [
    {
        role: "assistant",
        tool_calls: [
        {
            function: {
            name: "get_weather",
            arguments: JSON.stringify({ city: "san francisco" }),
            },
        },
        ],
    },
];

// Create the evaluator
const evaluator = createTrajectoryMatchEvaluator({
  // Specify how the trajectories will be compared. `superset` will accept output trajectory as valid if it's a superset of the reference one. Other options include: strict, unordered and subset
  trajectoryMatchMode: "superset",
});

// Run the evaluator
const result = evaluator({
    outputs: outputs,
    referenceOutputs: referenceOutputs,
});

指定如何比较轨迹。superset 将接受输出轨迹为有效，如果它是参考轨迹的超集。其他选项包括：strict, unordered 和 subset

下一步，了解更多关于如何自定义轨迹匹配评估器的信息。

LLM-as-a-judge

您可以使用 LLM-as-a-judge 评估器，它使用 LLM 比较轨迹与参考输出并输出分数：

import {
    createTrajectoryLlmAsJudge,
    TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
} from "agentevals/trajectory/llm";

const evaluator = createTrajectoryLlmAsJudge({
    prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
    model: "openai:o3-mini",
});

运行评估器

要运行评估器，您首先需要创建一个 LangSmith 数据集。要使用预构建的 AgentEvals 评估器，您必须拥有具有以下模式的数据集：

input: {"messages": [...]} 调用 agent 的输入消息。
output: {"messages": [...]} agent 输出中的预期消息历史记录。对于轨迹评估，您可以选择仅保留 assistant 消息。

import { Client } from "langsmith";
import { createAgent } from "langchain";
import { createTrajectoryMatchEvaluator } from "agentevals/trajectory/match";

const client = new Client();
const agent = createAgent({...});
const evaluator = createTrajectoryMatchEvaluator({...});

const experimentResults = await client.evaluate(
    (inputs) => agent.invoke(inputs),
    // replace with your dataset name
    { data: "<Name of your dataset>" },
    { evaluators: [evaluator] }
);

在 GitHub 上编辑此页面或提交问题。

将这些文档连接到 Claude、VSCode 等，通过 MCP 获得实时解答。

Get started

Core components

Middleware

Advanced usage

Agent development

Deploy with LangSmith

创建评估器

LLM-as-a-judge

运行评估器

Get started

Core components

Middleware

Advanced usage

Agent development

Deploy with LangSmith

​创建评估器

​LLM-as-a-judge

​运行评估器

创建评估器

LLM-as-a-judge

运行评估器