Skip to main content
RAGatouille 使使用 ColBERT 尽可能简单!ColBERT 是一个快速准确的检索模型,可以在数十毫秒内在大型文本集合上进行可扩展的基于 BERT 的搜索。 参见论文 ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
我们可以通过多种方式使用 RAGatouille。

设置

集成存在于 ragatouille 包中。
pip install -U ragatouille
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
[Jan 10, 10:53:28] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(

检索器

我们可以将 RAGatouille 用作检索器。有关更多信息,请参阅 RAGatouille 检索器

文档压缩器

我们还可以将 RAGatouille 开箱即用地作为重排序器。这将允许我们使用 ColBERT 对任何通用检索器检索到的结果进行重排序。这样做的好处是我们可以在任何现有索引之上执行此操作,无需创建新索引。

设置普通检索器

首先,让我们设置一个普通检索器作为示例。
import requests
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


def get_wikipedia_page(title: str):
    """
    检索 Wikipedia 页面的完整文本内容。

    :param title: str - Wikipedia 页面的标题。
    :return: str - 页面的完整文本内容(原始字符串)。
    """
    # Wikipedia API 端点
        URL = "https://en.wikipedia.org/w/api.php"

    # API 请求的参数
        params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # 自定义 User-Agent header 以符合 Wikipedia 的最佳实践
        headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

        response = requests.get(URL, params=params, headers=headers)
        data = response.json()

    # 提取页面内容
        page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None


text = get_wikipedia_page("Hayao_Miyazaki")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.create_documents([text])
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever(
        search_kwargs={"k": 10}
)
docs = retriever.invoke("What animation studio did Miyazaki found")
docs[0]
Document(page_content='collaborative projects. In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.')
我们可以看到结果与所提问题的相关性不是很高

使用 ColBERT 作为重排序器

from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
        base_compressor=RAG.as_langchain_document_compressor(), base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What animation studio did Miyazaki found"
)
/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
compressed_docs[0]
Document(page_content='In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates". Some of the architecture in the film was also inspired by a Welsh mining town; Miyazaki witnessed the mining strike upon his first', metadata={'relevance_score': 26.5194149017334})
这个答案的相关性要高得多!