Amazon DocumentDB(兼容 MongoDB) 使您可以轻松地在云中设置、操作和扩展兼容 MongoDB 的数据库。 使用 Amazon DocumentDB,您可以运行与 MongoDB 相同的应用程序代码,并使用与 MongoDB 相同的驱动程序和工具。 Amazon DocumentDB 的向量搜索将基于 JSON 的文档数据库的灵活性和丰富查询能力与向量搜索的强大功能相结合。本 Notebook 展示如何使用 Amazon DocumentDB 向量搜索 在集合中存储文档、创建索引,并使用近似最近邻算法(如”cosine”、“euclidean”和”dotProduct”)执行向量搜索查询。默认情况下,DocumentDB 会创建层次可导航小世界(HNSW)索引。要了解其他支持的向量索引类型,请参阅上方链接的文档。 要使用 DocumentDB,您必须首先部署一个集群。请参阅开发者指南了解更多详情。 免费注册,立即开始使用。
Copy
!pip install pymongo
Copy
import getpass
# DocumentDB 连接字符串
# 例如:"mongodb://{username}:{pass}@{cluster_endpoint}:{port}/?{params}"
CONNECTION_STRING = getpass.getpass("DocumentDB 集群 URI:")
INDEX_NAME = "izzy-test-index"
NAMESPACE = "izzy_test_db.izzy_test_collection"
DB_NAME, COLLECTION_NAME = NAMESPACE.split(".")
OpenAIEmbeddings,因此需要设置 OpenAI 环境变量。
Copy
import getpass
import os
# 设置 OpenAI 环境变量
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["OPENAI_EMBEDDINGS_DEPLOYMENT"] = (
"smart-agent-embedding-ada" # 嵌入模型的部署名称
)
os.environ["OPENAI_EMBEDDINGS_MODEL_NAME"] = "text-embedding-ada-002" # 模型名称
Copy
from langchain.vectorstores.documentdb import (
DocumentDBSimilarityType,
DocumentDBVectorSearch,
)
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
SOURCE_FILE_NAME = "../../how_to/state_of_the_union.txt"
loader = TextLoader(SOURCE_FILE_NAME)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# OpenAI 设置
model_deployment = os.getenv(
"OPENAI_EMBEDDINGS_DEPLOYMENT", "smart-agent-embedding-ada"
)
model_name = os.getenv("OPENAI_EMBEDDINGS_MODEL_NAME", "text-embedding-ada-002")
openai_embeddings: [`OpenAIEmbeddings`](https://reference.langchain.com/python/langchain-openai/embeddings/base/OpenAIEmbeddings) = OpenAIEmbeddings(
deployment=model_deployment, model=model_name
)
Copy
from pymongo import MongoClient
INDEX_NAME = "izzy-test-index-2"
NAMESPACE = "izzy_test_db.izzy_test_collection"
DB_NAME, COLLECTION_NAME = NAMESPACE.split(".")
client: MongoClient = MongoClient(CONNECTION_STRING)
collection = client[DB_NAME][COLLECTION_NAME]
model_deployment = os.getenv(
"OPENAI_EMBEDDINGS_DEPLOYMENT", "smart-agent-embedding-ada"
)
model_name = os.getenv("OPENAI_EMBEDDINGS_MODEL_NAME", "text-embedding-ada-002")
vectorstore = DocumentDBVectorSearch.from_documents(
documents=docs,
embedding=openai_embeddings,
collection=collection,
index_name=INDEX_NAME,
)
# 上述模型使用的维度数
dimensions = 1536
# 指定相似度算法,有效选项为:
# cosine (COS)、euclidean (EUC)、dotProduct (DOT)
similarity_algorithm = DocumentDBSimilarityType.COS
vectorstore.create_index(dimensions, similarity_algorithm)
Copy
{ 'createdCollectionAutomatically' : false,
'numIndexesBefore' : 1,
'numIndexesAfter' : 2,
'ok' : 1,
'operationTime' : Timestamp(1703656982, 1)}
Copy
# 在查询的嵌入与文档的嵌入之间执行相似度搜索
query = "What did the President say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)
Copy
print(docs[0].page_content)
Copy
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you're at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I'd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation's top legal minds, who will continue Justice Breyer's legacy of excellence.
Copy
vectorstore = DocumentDBVectorSearch.from_connection_string(
connection_string=CONNECTION_STRING,
namespace=NAMESPACE,
embedding=openai_embeddings,
index_name=INDEX_NAME,
)
# 在查询与已导入文档之间执行相似度搜索
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)
Copy
print(docs[0].page_content)
Copy
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you're at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I'd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation's top legal minds, who will continue Justice Breyer's legacy of excellence.
Copy
# 在查询与已导入文档之间执行相似度搜索
query = "Which stats did the President share about the U.S. economy"
docs = vectorstore.similarity_search(query)
Copy
print(docs[0].page_content)
Copy
And unlike the $2 Trillion tax cut passed in the previous administration that benefitted the top 1% of Americans, the American Rescue Plan helped working people—and left no one behind.
And it worked. It created jobs. Lots of jobs.
In fact—our economy created over 6.5 Million new jobs just last year, more jobs created in one year
than ever before in the history of America.
Our economy grew at a rate of 5.7% last year, the strongest growth in nearly 40 years, the first step in bringing fundamental change to an economy that hasn't worked for the working people of this nation for too long.
For the past 40 years we were told that if we gave tax breaks to those at the very top, the benefits would trickle down to everyone else.
But that trickle-down theory led to weaker economic growth, lower wages, bigger deficits, and the widest gap between those at the top and everyone else in nearly a century.
问答
Copy
qa_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 25},
)
Copy
from langchain_core.prompts import PromptTemplate
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
Copy
from langchain_classic.chains import RetrievalQA
from langchain_openai import OpenAI
qa = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=qa_retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT},
)
docs = qa({"query": "gpt-4 compute requirements"})
print(docs["result"])
print(docs["source_documents"])
通过 MCP 将这些文档连接到 Claude、VSCode 等,获取实时答案。

