Skip to main content
langchainRunnable 对象(例如聊天模型、检索器、链等)可以直接传递给 evaluate() / aevaluate()

设置

让我们定义一个简单的链来进行评估。首先,安装所有必需的包:
pip install -U langsmith langchain[openai]
现在定义一个链:
from langchain.chat_models import init_chat_model
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

instructions = (
    "Please review the user query below and determine if it contains any form "
    "of toxic behavior, such as insults, threats, or highly negative comments. "
    "Respond with 'Toxic' if it does, and 'Not toxic' if it doesn't."
)

prompt = ChatPromptTemplate(
    [("system", instructions), ("user", "{text}")],
)

model = init_chat_model("gpt-5.4")
chain = prompt | model | StrOutputParser()

评估

要评估我们的链,可以直接将其传递给 evaluate() / aevaluate() 方法。请注意,链的输入变量必须与示例输入的键匹配。在这种情况下,示例输入应具有 {"text": "..."} 的形式。
import asyncio
from langsmith import Client, aevaluate

client = Client()

# Clone a dataset of texts with toxicity labels.
# Each example input has a "text" key and each output has a "label" key.
dataset = client.clone_public_dataset(
    "https://smith.langchain.com/public/3d6831e6-1680-4c88-94df-618c8e01fc55/d"
)

def correct(outputs: dict, reference_outputs: dict) -> bool:
    # Since our chain outputs a string not a dict, this string
    # gets stored under the default "output" key in the outputs dict:
    actual = outputs["output"]
    expected = reference_outputs["label"]
    return actual == expected

async def main():
    results = await aevaluate(
        chain,
        data=dataset,
        evaluators=[correct],
        experiment_prefix="gpt-5.4, baseline",
        metadata={"models": "openai:gpt-5.4"},  # optional, used to populate model/prompt/tool columns in UI
    )
    print(results)

asyncio.run(main())
可运行对象会为每个输出进行适当的追踪。 可运行对象评估

相关内容