Skip to main content
Activeloop Deep Memory 是一套工具,使你能够针对你的用例优化向量存储,并在 LLM 应用中实现更高的准确率。
检索增强生成RAG)最近受到了广泛关注。随着高级 RAG 技术和 agent 的涌现,它们扩展了 RAG 的潜在应用范围。然而,将 RAG 集成到生产环境时可能面临几个挑战。在生产环境中实施 RAG 时需要考虑的主要因素是准确率(召回率)、成本和延迟。对于基本用例,OpenAI 的 Ada 模型配合朴素相似性搜索可以产生令人满意的结果。然而,对于搜索时需要更高准确率或召回率的场景,可能需要采用高级检索技术。这些方法可能涉及变化的数据块大小、多次重写查询等,可能增加延迟和成本。Activeloop 的 Deep MemoryActiveloop Deep Lake 用户可用的功能,通过引入一个经过训练以匹配用户查询与语料库相关数据的微型神经网络层来解决这些问题。虽然此添加在搜索期间会产生极小的延迟,但它可以将检索准确率提高高达 27%,且具有成本效益且易于使用,无需任何额外的高级 RAG 技术。 在本教程中,我们将解析 DeepLake 文档,并创建一个可以从文档中回答问题的 RAG 系统。

1. 数据集创建

在本教程中,我们将使用 BeautifulSoup 库和 LangChain 的文档解析器(如 Html2TextTransformerAsyncHtmlLoader)来解析 activeloop 的文档。因此我们需要安装以下库:
pip install -qU  tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas
你还需要创建一个 Activeloop 账号。
ORG_ID = "..."
from langchain_classic.chains import RetrievalQA
from langchain_community.vectorstores import DeepLake
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API token: ")
# # 如果没有通过 CLI 登录,需要提供 activeloop 令牌:`activeloop login -u <USERNAME> -p <PASSWORD>`
if "ACTIVELOOP_TOKEN" not in os.environ:
    os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass(
        "Enter your ActiveLoop API token: "
    )  # 从 https://app.activeloop.ai 获取你的 API 令牌,点击右上角的头像,选择"API Tokens"

token = os.getenv("ACTIVELOOP_TOKEN")
openai_embeddings = OpenAIEmbeddings()
db = DeepLake(
    dataset_path=f"hub://{ORG_ID}/deeplake-docs-deepmemory",  # org_id 代表你在 activeloop 的用户名或组织名
    embedding=openai_embeddings,
    runtime={"tensor_db": True},
    token=token,
    # overwrite=True, # 如果想覆盖整个数据集,设置 overwrite 标志
    read_only=False,
)
使用 BeautifulSoup 解析网页中的所有链接
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup


def get_all_links(url):
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to retrieve the page: {url}")
        return []

    soup = BeautifulSoup(response.content, "html.parser")

    # 查找所有通常包含 href 属性链接的 'a' 标签
    links = [
        urljoin(url, a["href"]) for a in soup.find_all("a", href=True) if a["href"]
    ]

    return links


base_url = "https://docs.deeplake.ai/en/latest/"
all_links = get_all_links(base_url)
加载数据:
from langchain_community.document_loaders.async_html import AsyncHtmlLoader

loader = AsyncHtmlLoader(all_links)
docs = loader.load()
将数据转换为用户可读格式:
from langchain_community.document_transformers import Html2TextTransformer

html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
现在,让我们对文档进行进一步分块,因为其中一些包含过多文本:
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 4096
docs_new = []

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
)

for doc in docs_transformed:
    if len(doc.page_content) < chunk_size:
        docs_new.append(doc)
    else:
        docs = text_splitter.create_documents([doc.page_content])
        docs_new.extend(docs)
填充向量存储:
docs = db.add_documents(docs_new)

2. 生成合成查询并训练 deep memory

下一步是训练一个 deep_memory 模型,将用户查询与已有数据集对齐。如果你还没有任何用户查询,不用担心,我们将使用 LLM 生成它们!

TODO:添加图片

如上图所示,这是 deep_memory 工作的整体架构。如你所见,要训练它,你需要相关性、查询以及语料库数据(我们想要查询的数据)。语料库数据在上一节中已填充,这里我们将生成问题和相关性。
  1. questions - 是字符串文本,每个字符串代表一个查询
  2. relevance - 包含每个问题的真实答案链接。可能有多个文档包含给定问题的答案。因此,relevance 是 List[List[tuple[str, float]]],外层列表代表查询,内层列表代表相关文档。元组包含字符串-浮点数对,其中字符串表示源文档的 ID(对应数据集中的 id 张量),浮点数表示当前文档与问题的相关程度。
现在,让我们生成合成问题和相关性:
from typing import List

from langchain_classic.chains.openai_functions import (
    create_structured_output_chain,
)
from langchain.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# 获取数据集文档和 ID(如果存在)(可选,你也可以直接摄取)
docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)["value"]
ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)["value"]
# 如果我们明确传入模型,需要确保它支持 OpenAI 函数调用 API。
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)


class Questions(BaseModel):
    """用于识别相关信息的结构。"""

    question: str = Field(..., description="Questions about text")


prompt_msgs = [
    SystemMessage(
        content="You are a world class expert for generating questions based on provided context. \
                You make sure the question can be answered by the text."
    ),
    HumanMessagePromptTemplate.from_template(
        "Use the given text to generate a question from the following input: {input}"
    ),
    HumanMessage(content="Tips: Make sure to answer in the correct format"),
]
prompt = ChatPromptTemplate(messages=prompt_msgs)
chain = create_structured_output_chain(Questions, llm, prompt, verbose=True)

text = "# Understanding Hallucinations and Bias ## **Introduction** In this lesson, we'll cover the concept of **hallucinations** in LLMs, highlighting their influence on AI applications and demonstrating how to mitigate them using techniques like the retriever's architectures. We'll also explore **bias** within LLMs with examples."
questions = chain.run(input=text)
print(questions)
import random

from langchain_openai import OpenAIEmbeddings
from tqdm import tqdm


def generate_queries(docs: List[str], ids: List[str], n: int = 100):
    questions = []
    relevances = []
    pbar = tqdm(total=n)
    while len(questions) < n:
        # 1. 随机抽取一段文本和相关性 ID
        r = random.randint(0, len(docs) - 1)
        text, label = docs[r], ids[r]

        # 2. 生成查询并分配相关性 ID
        generated_qs = [chain.run(input=text).question]
        questions.extend(generated_qs)
        relevances.extend([[(label, 1)] for _ in generated_qs])
        pbar.update(len(generated_qs))
        if len(questions) % 10 == 0:
            print(f"q: {len(questions)}")
    return questions[:n], relevances[:n]


chain = create_structured_output_chain(Questions, llm, prompt, verbose=False)
questions, relevances = generate_queries(docs, ids, n=200)

train_questions, train_relevances = questions[:100], relevances[:100]
test_questions, test_relevances = questions[100:], relevances[100:]
现在我们创建了 100 个训练查询以及 100 个测试查询。现在让我们训练 deep_memory:
job_id = db.vectorstore.deep_memory.train(
    queries=train_questions,
    relevance=train_relevances,
)
让我们追踪训练进度:
db.vectorstore.deep_memory.status("6538939ca0b69a9ca45c528c")
--------------------------------------------------------------
|                  6538e02ecda4691033a51c5b                  |
--------------------------------------------------------------
| status                     | completed                     |
--------------------------------------------------------------
| progress                   | eta: 1.4 seconds              |
|                            | recall@10: 79.00% (+34.00%)   |
--------------------------------------------------------------
| results                    | recall@10: 79.00% (+34.00%)   |
--------------------------------------------------------------

3. 评估 deep memory 性能

太好了,我们已经训练好了模型!它在召回率上显示出相当大的提升,但现在我们如何使用它并在未见过的新数据上进行评估呢?在本节中,我们将深入探讨模型评估和推理部分,了解如何与 LangChain 配合使用以提高检索准确率。

3.1 deep memory 评估

首先,我们可以使用 deep_memory 内置的评估方法。它计算几个 recall 指标,只需几行代码即可完成。
recall = db.vectorstore.deep_memory.evaluate(
    queries=test_questions,
    relevance=test_relevances,
)
Embedding queries took 0.81 seconds
---- Evaluating without model ----
Recall@1:   9.0%
Recall@3:   19.0%
Recall@5:   24.0%
Recall@10:   42.0%
Recall@50:   93.0%
Recall@100:   98.0%
---- Evaluating with model ----
Recall@1:   19.0%
Recall@3:   42.0%
Recall@5:   49.0%
Recall@10:   69.0%
Recall@50:   97.0%
Recall@100:   97.0%
在未见过的测试数据集上也显示出相当大的提升!!!

3.2 deep memory + RAGas

from ragas.langchain import RagasEvaluatorChain
from ragas.metrics import (
    context_recall,
)
让我们将召回率转换为真实答案:
def convert_relevance_to_ground_truth(docs, relevance):
    ground_truths = []

    for rel in relevance:
        ground_truth = []
        for doc_id, _ in rel:
            ground_truth.append(docs[doc_id])
        ground_truths.append(ground_truth)
    return ground_truths
ground_truths = convert_relevance_to_ground_truth(docs, test_relevances)

for deep_memory in [False, True]:
    print("\nEvaluating with deep_memory =", deep_memory)
    print("===================================")

    retriever = db.as_retriever()
    retriever.search_kwargs["deep_memory"] = deep_memory

    qa_chain = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(model="gpt-3.5-turbo"),
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
    )

    metrics = {
        "context_recall_score": 0,
    }

    eval_chains = {m.name: RagasEvaluatorChain(metric=m) for m in [context_recall]}

    for question, ground_truth in zip(test_questions, ground_truths):
        result = qa_chain({"query": question})
        result["ground_truths"] = ground_truth
        for name, eval_chain in eval_chains.items():
            score_name = f"{name}_score"
            metrics[score_name] += eval_chain(result)[score_name]

    for metric in metrics:
        metrics[metric] /= len(test_questions)
        print(f"{metric}: {metrics[metric]}")
    print("===================================")
Evaluating with deep_memory = False
===================================
context_recall_score = 0.3763423145
===================================

Evaluating with deep_memory = True
===================================
context_recall_score = 0.5634545323
===================================

3.3 deep memory 推理

TODO:添加图片

使用 deep_memory
retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = True
retriever.search_kwargs["k"] = 10

query = "Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome."
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"), chain_type="stuff", retriever=retriever
)
print(qa.run(query))
The base htype of the 'video_seq' tensor is 'video'.
不使用 deep_memory
retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = False
retriever.search_kwargs["k"] = 10

query = "Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome."
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"), chain_type="stuff", retriever=retriever
)
qa.run(query)
The text does not provide information on the base htype of the 'video_seq' tensor.

3.4 deep memory 成本节省

Deep Memory 无需改变你现有的工作流程即可提高检索准确率。此外,通过减少输入到 LLM 的 top_k,你可以通过减少 token 使用量来显著降低推理成本。