Skip to main content
本 notebook 介绍如何开始使用 VDMS 作为向量存储。
Intel’s Visual Data Management System (VDMS) is a storage solution for efficient access of big-”visual”-data that aims to achieve cloud scale by searching for relevant visual data via visual metadata stored as a graph and enabling machine friendly enhancements to visual data for faster access. VDMS is licensed under MIT. For more information on VDMS, visit this page, and find the LangChain API reference here.
VDMS 支持:
  • K 近邻搜索
  • 欧氏距离(L2)和内积(IP)
  • 用于索引和计算距离的库:FaissFlat(默认)、FaissHNSWFlat、FaissIVFFlat、Flinng、TileDBDense、TileDBSparse
  • 文本、图像和视频的嵌入
  • 向量和元数据搜索

配置

要访问 VDMS 向量存储,您需要安装 langchain-vdms 集成包,并通过公开可用的 Docker 镜像部署 VDMS 服务器。 为简单起见,本 notebook 将使用端口 55555 在本地主机上部署 VDMS 服务器。
pip install -qU "langchain-vdms>=0.1.3"
!docker run --no-healthcheck --rm -d -p 55555:55555 --name vdms_vs_test_nb intellabs/vdms:latest
!sleep 5

凭证

您无需任何凭证即可使用 VDMS 要启用模型调用的自动追踪,请设置您的 LangSmith API key:
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

初始化

使用 VDMS Client 连接到 VDMS 向量存储,使用 FAISS IndexFlat 索引(默认)和欧氏距离(默认)作为相似性搜索的距离度量。
# | output: false
# | echo: false

! pip install -qU langchain-huggingface
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
from langchain_vdms.vectorstores import VDMS, VDMS_Client

collection_name = "test_collection_faiss_L2"

vdms_client = VDMS_Client(host="localhost", port=55555)

vector_store = VDMS(
    client=vdms_client,
    embedding=embeddings,
    collection_name=collection_name,
    engine="FaissFlat",
    distance_strategy="L2",
)

管理向量存储

向向量存储添加条目

import logging

logging.basicConfig()
logging.getLogger("langchain_vdms.vectorstores").setLevel(logging.INFO)

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
    id=2,
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
    id=3,
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
    id=4,
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
    id=5,
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
    id=6,
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
    id=7,
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
    id=8,
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
    id=9,
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
    id=10,
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]

doc_ids = [str(i) for i in range(1, 11)]
vector_store.add_documents(documents=documents, ids=doc_ids)
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
如果多次提供同一 id,add_documents 不会检查 id 是否唯一。因此,在添加之前使用 upsert 删除现有 id 条目。
vector_store.upsert(documents, ids=doc_ids)
{'succeeded': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
 'failed': []}

更新向量存储中的条目

updated_document_1 = Document(
    page_content="I had chocolate chip pancakes and fried eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

updated_document_2 = Document(
    page_content="The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees.",
    metadata={"source": "news"},
    id=2,
)

vector_store.update_documents(
    ids=doc_ids[:2],
    documents=[updated_document_1, updated_document_2],
    batch_size=2,
)

从向量存储删除条目

vector_store.delete(ids=doc_ids[-1])
True

查询向量存储

创建向量存储并添加相关文档后,您很可能希望在链或代理运行期间对其进行查询。

直接查询

执行简单的相似性搜索如下:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": ["==", "tweet"]},
)
for doc in results:
    print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0063 seconds
* ID=3: Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* ID=8: LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
如果您想执行相似性搜索并获取对应分数,可以运行:
results = vector_store.similarity_search_with_score(
        "Will it be hot tomorrow?", k=1, filter={"source": ["==", "news"]}
)
for doc, score in results:
        print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0460 seconds
* [SIM=0.753577] The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees. [{'source': 'news'}]
如果您想使用嵌入向量执行相似性搜索,可以运行:
results = vector_store.similarity_search_by_vector(
    embedding=embeddings.embed_query("I love green eggs and ham!"), k=1
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0044 seconds
* The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees. [{'source': 'news'}]

转换为检索器后查询

您也可以将向量存储转换为检索器,以便在链中更轻松地使用。
retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 3},
)
results = retriever.invoke("Stealing from the bank is a crime")
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0042 seconds
* Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]
* The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]
* Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "k": 1,
        "score_threshold": 0.0,  # >= score_threshold
    },
)
results = retriever.invoke("Stealing from the bank is a crime")
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0042 seconds
* Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]
retriever = vector_store.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 1, "fetch_k": 10},
)
results = retriever.invoke(
        "Stealing from the bank is a crime", filter={"source": ["==", "news"]}
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")
INFO:langchain_vdms.vectorstores:VDMS mmr search took 0.0042 secs
* Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]

删除集合

之前,我们根据 id 删除文档。这里,由于没有提供 ID,所有文档都将被删除。
print("Documents before deletion: ", vector_store.count())

vector_store.delete(collection_name=collection_name)

print("Documents after deletion: ", vector_store.count())
Documents before deletion:  10
Documents after deletion:  0

用于检索增强生成

有关如何将此向量存储用于检索增强生成(RAG)的指南,请参阅以下内容:

使用其他引擎进行相似性搜索

VDMS 支持多种用于索引和计算距离的库:FaissFlat(默认)、FaissHNSWFlat、FaissIVFFlat、Flinng、TileDBDense 和 TileDBSparse。 默认情况下,向量存储使用 FaissFlat。下面我们展示使用其他引擎的几个示例。

使用 faiss HNSWFlat 和欧氏距离进行相似性搜索

这里,我们使用 Faiss IndexHNSWFlat 索引和 L2 作为相似性搜索的距离度量将文档添加到 VDMS。我们搜索与查询相关的三个文档(k=3)并同时返回分数和文档。
db_FaissHNSWFlat = VDMS.from_documents(
    documents,
    client=vdms_client,
    ids=doc_ids,
    collection_name="my_collection_FaissHNSWFlat_L2",
    embedding=embeddings,
    engine="FaissHNSWFlat",
    distance_strategy="L2",
)
# Query
k = 3
query = "LangChain provides abstractions to make working with LLMs easy"
docs_with_score = db_FaissHNSWFlat.similarity_search_with_score(query, k=k, filter=None)

for res, score in docs_with_score:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
INFO:langchain_vdms.vectorstores:Descriptor set my_collection_FaissHNSWFlat_L2 created
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.1272 seconds
* [SIM=0.716791] Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* [SIM=0.936718] LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* [SIM=1.834110] Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]

使用 faiss IVFFlat 和内积(IP)距离进行相似性搜索

我们使用 Faiss IndexIVFFlat 索引和 IP 作为相似性搜索的距离度量将文档添加到 VDMS。我们搜索与查询相关的三个文档(k=3)并同时返回分数和文档。
db_FaissIVFFlat = VDMS.from_documents(
    documents,
        client=vdms_client,
        ids=doc_ids,
        collection_name="my_collection_FaissIVFFlat_IP",
        embedding=embeddings,
        engine="FaissIVFFlat",
        distance_strategy="IP",
)

k = 3
query = "LangChain provides abstractions to make working with LLMs easy"
docs_with_score = db_FaissIVFFlat.similarity_search_with_score(query, k=k, filter=None)
for res, score in docs_with_score:
        print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
INFO:langchain_vdms.vectorstores:Descriptor set my_collection_FaissIVFFlat_IP created
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0052 seconds
* [SIM=0.641605] Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* [SIM=0.531641] LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* [SIM=0.082945] Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]

使用 FLINNG 和 IP 距离进行相似性搜索

在本节中,我们使用近邻组过滤器(FLINNG)索引和 IP 作为相似性搜索的距离度量将文档添加到 VDMS。我们搜索与查询相关的三个文档(k=3)并同时返回分数和文档。
db_Flinng = VDMS.from_documents(
    documents,
    client=vdms_client,
    ids=doc_ids,
    collection_name="my_collection_Flinng_IP",
    embedding=embeddings,
    engine="Flinng",
    distance_strategy="IP",
)
# Query
k = 3
query = "LangChain provides abstractions to make working with LLMs easy"
docs_with_score = db_Flinng.similarity_search_with_score(query, k=k, filter=None)
for res, score in docs_with_score:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
INFO:langchain_vdms.vectorstores:Descriptor set my_collection_Flinng_IP created
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0042 seconds
* [SIM=0.000000] I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]
* [SIM=0.000000] I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]
* [SIM=0.000000] I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]

按元数据过滤

在处理集合之前缩小范围可能很有帮助。 例如,可以使用 get_by_constraints 方法按元数据过滤集合。使用字典来过滤元数据。这里我们检索 langchain_id = "2" 的文档并将其从向量存储中删除。 注意: id 作为整数附加元数据生成,而 langchain_id(内部 ID)是每个条目的唯一字符串。
response, response_array = db_FaissIVFFlat.get_by_constraints(
    db_FaissIVFFlat.collection_name,
        limit=1,
        include=["metadata", "embeddings"],
        constraints={"langchain_id": ["==", "2"]},
)

# Delete id=2
db_FaissIVFFlat.delete(collection_name=db_FaissIVFFlat.collection_name, ids=["2"])

print("Deleted entry:")
for doc in response:
        print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")
Deleted entry:
* ID=2: The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]
response, response_array = db_FaissIVFFlat.get_by_constraints(
    db_FaissIVFFlat.collection_name,
        include=["metadata"],
)
for doc in response:
        print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")
* ID=10: I have a bad feeling I am going to get deleted :( [{'source': 'tweet'}]
* ID=9: The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]
* ID=8: LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* ID=7: The top 10 soccer players in the world right now. [{'source': 'website'}]
* ID=6: Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]
* ID=5: Wow! That was an amazing movie. I can't wait to see it again. [{'source': 'tweet'}]
* ID=4: Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]
* ID=3: Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* ID=1: I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]
这里我们使用 id 来过滤一系列 ID,因为它是整数。
response, response_array = db_FaissIVFFlat.get_by_constraints(
    db_FaissIVFFlat.collection_name,
        include=["metadata", "embeddings"],
        constraints={"source": ["==", "news"]},
)
for doc in response:
        print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")
* ID=9: The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]
* ID=4: Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]

停止 VDMS 服务器

!docker kill vdms_vs_test_nb
vdms_vs_test_nb