Skip to main content
DeepEval 是用于 LLM 单元测试的软件包。 借助 Confident,每个人都可以通过单元测试和集成测试加速迭代, 从而构建出更健壮的语言模型。我们为迭代过程中的每个步骤提供支持, 从合成数据创建到测试。
在本指南中,我们将演示如何测试和评估 LLM 的性能。我们将展示如何使用回调来衡量性能,以及如何定义自定义指标并将其记录到仪表盘中。 DeepEval 还提供:
  • 如何生成合成数据
  • 如何衡量性能
  • 用于随时间监控和回顾结果的仪表盘

安装与配置

pip install -qU  langchain langchain-openai langchain-community deepeval langchain-chroma

获取 API 凭据

按照以下步骤获取 DeepEval API 凭据:
  1. 前往 app.confident-ai.com
  2. 点击”Organization”
  3. 复制 API Key。
登录后,您还需要设置 implementation 名称。该名称用于描述实现类型。(建议取一个能说明项目特征的描述性名称。)
!deepeval login

配置 DeepEval

默认情况下,您可以使用 DeepEvalCallbackHandler 来设置要追踪的指标。但目前支持的指标有限(更多指标即将推出),当前支持:
from deepeval.metrics.answer_relevancy import AnswerRelevancy

# Here we want to make sure the answer is minimally relevant
answer_relevancy_metric = AnswerRelevancy(minimum_score=0.5)

快速开始

要使用 DeepEvalCallbackHandler,需要提供 implementation_name
from langchain_community.callbacks.confident_callback import DeepEvalCallbackHandler

deepeval_callback = DeepEvalCallbackHandler(
    implementation_name="langchainQuickstart", metrics=[answer_relevancy_metric]
)

场景 1:传入 LLM

您可以将其与 OpenAI 的 LLM 结合使用。
from langchain_openai import OpenAI

llm = OpenAI(
    temperature=0,
    callbacks=[deepeval_callback],
    verbose=True,
    openai_api_key="<YOUR_API_KEY>",
)
output = llm.generate(
    [
        "What is the best evaluation tool out there? (no bias at all)",
    ]
)
LLMResult(generations=[[Generation(text='\n\nQ: What did the fish say when he hit the wall? \nA: Dam.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nThe Moon \n\nThe moon is high in the midnight sky,\nSparkling like a star above.\nThe night so peaceful, so serene,\nFilling up the air with love.\n\nEver changing and renewing,\nA never-ending light of grace.\nThe moon remains a constant view,\nA reminder of life’s gentle pace.\n\nThrough time and space it guides us on,\nA never-fading beacon of hope.\nThe moon shines down on us all,\nAs it continues to rise and elope.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nQ. What did one magnet say to the other magnet?\nA. "I find you very attractive!"', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text="\n\nThe world is charged with the grandeur of God.\nIt will flame out, like shining from shook foil;\nIt gathers to a greatness, like the ooze of oil\nCrushed. Why do men then now not reck his rod?\n\nGenerations have trod, have trod, have trod;\nAnd all is seared with trade; bleared, smeared with toil;\nAnd wears man's smudge and shares man's smell: the soil\nIs bare now, nor can foot feel, being shod.\n\nAnd for all this, nature is never spent;\nThere lives the dearest freshness deep down things;\nAnd though the last lights off the black West went\nOh, morning, at the brown brink eastward, springs —\n\nBecause the Holy Ghost over the bent\nWorld broods with warm breast and with ah! bright wings.\n\n~Gerard Manley Hopkins", generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nQ: What did one ocean say to the other ocean?\nA: Nothing, they just waved.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text="\n\nA poem for you\n\nOn a field of green\n\nThe sky so blue\n\nA gentle breeze, the sun above\n\nA beautiful world, for us to love\n\nLife is a journey, full of surprise\n\nFull of joy and full of surprise\n\nBe brave and take small steps\n\nThe future will be revealed with depth\n\nIn the morning, when dawn arrives\n\nA fresh start, no reason to hide\n\nSomewhere down the road, there's a heart that beats\n\nBelieve in yourself, you'll always succeed.", generation_info={'finish_reason': 'stop', 'logprobs': None})]], llm_output={'token_usage': {'completion_tokens': 504, 'total_tokens': 528, 'prompt_tokens': 24}, 'model_name': 'text-davinci-003'})
您可以通过调用 is_successful() 方法来检查指标是否通过。
answer_relevancy_metric.is_successful()
# returns True/False
运行完成后,您可以在下方看到我们的仪表盘。 Dashboard

场景 2:不使用回调在 Chain 中追踪 LLM

要在不使用回调的情况下在 Chain 中追踪 LLM,可以在最后接入。 我们可以先定义一个简单的 Chain,如下所示。
import requests
from langchain_classic.chains import RetrievalQA
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

text_file_url = "https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt"

openai_api_key = "sk-XXX"

with open("state_of_the_union.txt", "w") as f:
    response = requests.get(text_file_url)
    f.write(response.text)

loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(openai_api_key=openai_api_key),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
)

# Providing a new question-answering pipeline
query = "Who is the president?"
result = qa.run(query)
定义 Chain 后,您可以手动检查答案相关性。
answer_relevancy_metric.measure(result, query)
answer_relevancy_metric.is_successful()

下一步?

您可以在此处创建自定义指标。 DeepEval 还提供其他功能,例如自动创建单元测试幻觉测试 如果您感兴趣,请查看我们的 GitHub 仓库:https://github.com/confident-ai/deepeval。欢迎提交 PR 并参与关于如何提升 LLM 性能的讨论。