Skip to main content
本页介绍如何在 LangChain 中将 Google Vertex AI Vector Search 用作向量存储。

概览

Google Vertex AI Vector Search 是一种全托管的高规模低延迟向量相似度检索解决方案,支持使用 Google ScaNN(可扩展最近邻)技术进行精确和近似最近邻(ANN)搜索。 Vertex AI Vector Search 提供两个版本:
  • Vector Search 2.0:使用 Collections 存储包含向量、元数据和内容的 Data Objects,提供统一的数据模型,操作更简便快捷。
  • Vector Search 1.0:使用部署到 Endpoints 的 Indexes,文档单独存储在 Google Cloud Storage 或 Datastore 中。
请根据您使用的版本选择对应章节: 关于从 Vertex AI Vector Search 1.0 迁移到 2.0,请参阅迁移指南

安装

安装 LangChain Google Vertex AI 集成包:
pip install -U langchain-google-vertexai

Vector Search 2.0

Vector Search 2.0 使用 Collections 存储 Data Objects。每个 Data Object 以统一结构包含向量、元数据和内容。

前提条件

  • 已启用 Vertex AI API 和 Vector Search API 的 Google Cloud 项目
    gcloud services enable vectorsearch.googleapis.com aiplatform.googleapis.com --project "{PROJECT_ID}"
    
  • 已创建 Vector Search Collection(参见创建 Collection
  • 具备适当的 IAM 权限(Vertex AI User 角色或等效权限)

创建 Collection(V2)

在使用 Vector Search 2.0 之前,您需要先创建一个 Collection。以下是创建与 LangChain 兼容的 Collection 的方法:
from google.cloud import vectorsearch_v1beta

# Configuration
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
COLLECTION_ID = "langchain-test-collection"

# Create the Vector Search service client
vector_search_service_client = vectorsearch_v1beta.VectorSearchServiceClient()

# Create the collection with schema compatible with LangChain
# IMPORTANT: To enable filtering, you must define filterable fields in data_schema.properties
request = vectorsearch_v1beta.CreateCollectionRequest(
    parent=f"projects/{PROJECT_ID}/locations/{LOCATION}",
    collection_id=COLLECTION_ID,
    collection={
        "display_name": "LangChain Test Collection",
        "description": "Collection for testing LangChain VectorSearchVectorStore with filtering",
        "data_schema": {
            "type": "object",
            "properties": {
                # Define fields you want to filter on
                "source": {"type": "string"},
                "category": {"type": "string"},
                "page": {"type": "number"},
                # Add more fields as needed for your specific use case
            },
        },
        "vector_schema": {
            # Vector field must be named "embedding" to match LangChain's default
            "embedding": {
                "dense_vector": {
                    "dimensions": 768  # For text-embedding-005
                }
            },
        },
    },
)

print(f"Creating collection: {COLLECTION_ID}")
operation = vector_search_service_client.create_collection(request=request)
print(f"Operation started: {operation.operation.name}")
print("Waiting for operation to complete...")

result = operation.result()
print(f"Collection created successfully!")
print(f"Resource name: {result.name}")
重要说明:
  • 向量字段必须命名为 "embedding" 以匹配 LangChain 的默认值(或使用 vector_field_name 参数)
  • V2 中只有在 data_schema.properties 中定义的字段才能用于过滤
  • 维度应与您的嵌入模型匹配(text-embedding-005 为 768)

初始化

from langchain_google_vertexai import VectorSearchVectorStore, VertexAIEmbeddings

# Initialize embeddings
embeddings = VertexAIEmbeddings(model_name="text-embedding-005")

# Create vector store from a Collection
# Use the same PROJECT_ID, LOCATION, and COLLECTION_ID from collection creation
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=LOCATION,
    collection_id=COLLECTION_ID,
    embedding=embeddings,
    api_version="v2",
)
关键参数:
  • collection_id:您的 Vector Search Collection ID(必填)
  • api_version:必须设置为 "v2"(必填)
  • project_id:GCP 项目 ID(必填)
  • region:Collection 所在的 GCP 区域(必填)
  • vector_field_name:Collection schema 中向量字段的名称(默认:"embedding"

添加文档

from langchain_core.documents import Document

# Create documents
docs = [
    Document(
        page_content="Google Vertex AI is a managed machine learning platform",
        metadata={"source": "docs", "category": "AI"}
    ),
    Document(
        page_content="LangChain integrates with Vertex AI Vector Search",
        metadata={"source": "blog", "category": "integration"}
    ),
]

# Add documents to vector store
ids = vector_store.add_documents(docs)
print(f"Added documents with IDs: {ids}")

添加文本

texts = [
    "Vertex AI provides scalable ML infrastructure",
    "Vector Search enables similarity search at scale",
]

metadatas = [
    {"source": "website", "page": 1},
    {"source": "website", "page": 2},
]

ids = vector_store.add_texts(texts=texts, metadatas=metadatas)

搜索

基本相似度搜索

# Basic similarity search
query = "What is Vertex AI?"
results = vector_store.similarity_search(query, k=5)

for doc in results:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")

带分数的相似度搜索

# Get similarity scores along with documents
results_with_scores = vector_store.similarity_search_with_score(
    "What is Vertex AI?",
    k=5
)

for doc, score in results_with_scores:
    print(f"Score: {score}")
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")

按向量搜索

# Search using a pre-computed embedding
embedding = embeddings.embed_query("Vertex AI features")

results = vector_store.similarity_search_by_vector_with_score(embedding, k=5)

for doc, score in results:
    print(f"Score: {score}")
    print(f"Content: {doc.page_content}\n")

过滤

Vector Search 2.0 使用基于字典的查询语法过滤 Data Objects:
# Simple equality filter
results = vector_store.similarity_search(
    "AI features",
    k=5,
    filter={"source": {"$eq": "docs"}}
)

# Comparison operators
results = vector_store.similarity_search(
    "recent pages",
    k=5,
    filter={"page": {"$gte": 10}}
)

# Logical AND
results = vector_store.similarity_search(
    "AI documentation",
    k=5,
    filter={
        "$and": [
            {"source": {"$eq": "docs"}},
            {"category": {"$eq": "AI"}}
        ]
    }
)

# Logical OR
results = vector_store.similarity_search(
    "documentation",
    k=5,
    filter={
        "$or": [
            {"source": {"$eq": "docs"}},
            {"source": {"$eq": "blog"}}
        ]
    }
)

# Less than
results = vector_store.similarity_search(
    "early pages",
    k=5,
    filter={"page": {"$lt": 5}}
)
支持的操作符:
  • $eq:等于
  • $ne:不等于
  • $lt:小于
  • $lte:小于等于
  • $gt:大于
  • $gte:大于等于
  • $and:逻辑与
  • $or:逻辑或
  • $not:逻辑非
更多详情请参阅 Vector Search 2.0 查询文档

删除操作

按 ID 删除

# Delete specific documents by ID
ids_to_delete = ["id1", "id2", "id3"]
vector_store.delete(ids=ids_to_delete)

按元数据过滤删除

注意:当前 V2 API 对按元数据过滤删除存在限制。推荐的方法是:
  1. 使用带过滤条件的 similarity_search 获取文档 ID
  2. 按 ID 删除
# Recommended: Search first, then delete by IDs
results = vector_store.similarity_search(
    "query",  # Use a broad query
    k=1000,   # Get more results
    filter={"source": {"$eq": "old_docs"}}
)
ids_to_delete = [doc.metadata.get("id") for doc in results if "id" in doc.metadata]
vector_store.delete(ids=ids_to_delete)
或者,如果您的环境支持直接按元数据删除:
# Direct deletion by metadata (may have limitations)
try:
    vector_store.delete(metadata={"source": {"$eq": "old_docs"}})
except Exception as e:
    # Fall back to search-then-delete approach
    print(f"Direct deletion failed: {e}")

高级功能

Vector Search 2.0 提供了多种超越传统稠密向量搜索的高级搜索能力。

语义搜索

语义搜索使用 Vertex AI 模型自动从查询文本生成嵌入向量。您的 Collection 必须在向量 schema 中配置 vertex_embedding_config
# Semantic search with auto-generated embeddings
results = vector_store.semantic_search(
    query="Tell me about animals",
    k=5,
    search_field="embedding",  # Vector field with auto-embedding config
    task_type="RETRIEVAL_QUERY",  # Optimizes embeddings for search queries
    filter={"category": {"$eq": "wildlife"}}  # Optional filtering
)

for doc in results:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")
任务类型:
  • RETRIEVAL_QUERY:用于搜索查询(默认)
  • RETRIEVAL_DOCUMENT:用于文档索引
  • SEMANTIC_SIMILARITY:用于语义相似度任务
  • CLASSIFICATION:用于分类任务
  • CLUSTERING:用于聚类任务

文本搜索

文本搜索在数据字段上执行关键词/全文匹配,不使用嵌入向量。
# Keyword search on data fields
results = vector_store.text_search(
    query="Python programming",
    k=10,
    data_field_names=["page_content", "title"]  # Fields to search in
)

for doc in results:
    print(f"Content: {doc.page_content}\n")
!!! note 文本搜索不支持过滤。如果需要过滤,请使用 semantic_search()similarity_search()

混合搜索

混合搜索将语义搜索(自动生成嵌入向量)和文本搜索(关键词匹配)相结合,使用倒数排名融合(RRF)算法生成最优排名结果。
# Hybrid search: semantic understanding + keyword matching
results = vector_store.hybrid_search(
    query="Men's outfit for beach",
    k=10,
    search_field="embedding",  # Vector field with auto-embedding config
    data_field_names=["page_content"],  # Fields for text search
    task_type="RETRIEVAL_QUERY",
    filter={"price": {"$lt": 100}},  # Optional filter for semantic search
    semantic_weight=1.0,  # Weight for semantic results
    text_weight=1.0  # Weight for keyword results
)

for doc in results:
    print(f"Content: {doc.page_content}\n")
权重参数:
  • semantic_weight 越高:越侧重语义理解
  • text_weight 越高:越侧重精确关键词匹配
  • 等权重(默认):均衡结果
在语义搜索和文本搜索结果中均排名靠前的内容,在合并结果中排名最高。 更多信息请参阅 Vector Search 2.0 文档

自定义向量字段名称

如果您的 Collection schema 对向量使用了自定义字段名称:
vector_store = VectorSearchVectorStore.from_components(
    project_id="your-project-id",
    region="us-central1",
    collection_id="your-collection-id",
    embedding=embeddings,
    api_version="v2",
    vector_field_name="custom_embedding_field",  # Match your schema
)

其他资源


Vector Search 1.0

本文档介绍如何使用与 Google Cloud Vertex AI Vector Search 向量数据库相关的功能。
Google Vertex AI Vector Search,前称 Vertex AI Matching Engine,提供业界领先的高规模低延迟向量数据库服务。这类向量数据库通常被称为向量相似度匹配或近似最近邻(ANN)服务。
注意:LangChain API 要求已创建并部署好 Endpoint 和 Index。创建 Index 最长可能需要一个小时。
关于如何创建 Index,请参阅 创建 Index 并部署到 Endpoint 章节 如果您已有已部署的 Index,请跳至 从文本创建 VectorStore

创建 Index 并部署到 Endpoint

  • 本节演示如何创建新的 Index 并将其部署到 Endpoint
# TODO : Set values as per your requirements
# Project and Storage Constants
PROJECT_ID = "<my_project_id>"
REGION = "<my_region>"
BUCKET = "<my_gcs_bucket>"
BUCKET_URI = f"gs://{BUCKET}"

# The number of dimensions for the textembedding-gecko@003 is 768
# If other embedder is used, the dimensions would probably need to change.
DIMENSIONS = 768

# Index Constants
DISPLAY_NAME = "<my_matching_engine_index_id>"
DEPLOYED_INDEX_ID = "<my_matching_engine_endpoint_id>"
# Create a bucket.
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

使用 VertexAIEmbeddings 作为嵌入模型

from google.cloud import aiplatform
from langchain_google_vertexai import VertexAIEmbeddings
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)
embedding_model = VertexAIEmbeddings(model_name="text-embedding-005")

创建空 Index

注意: 创建 Index 时,需要通过 “index_update_method” 指定更新方式,可选 “BATCH_UPDATE” 或 “STREAM_UPDATE”
批量索引适用于批量更新场景,即在一段固定时间内(如每周或每月)对数据进行集中处理。流式索引适用于在新数据添加到数据存储时即时更新索引的场景,例如书店希望新库存尽快在线上展示。选择哪种类型很重要,因为配置和要求各不相同。
关于配置 Index 的更多详情,请参阅官方文档
# NOTE : This operation can take upto 30 seconds
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    index_update_method="STREAM_UPDATE",  # allowed values BATCH_UPDATE , STREAM_UPDATE
)

创建 Endpoint

# Create an endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"{DISPLAY_NAME}-endpoint", public_endpoint_enabled=True
)

将 Index 部署到 Endpoint

# NOTE : This operation can take upto 20 minutes
my_index_endpoint = my_index_endpoint.deploy_index(
    index=my_index, deployed_index_id=DEPLOYED_INDEX_ID
)

my_index_endpoint.deployed_indexes

从文本创建向量存储

注意:如果您已有现成的 Index 和 Endpoint,可以使用以下代码加载
# TODO : replace 1234567890123456789 with your acutial index ID
my_index = aiplatform.MatchingEngineIndex("1234567890123456789")

# TODO : replace 1234567890123456789 with your acutial endpoint ID
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint("1234567890123456789")
from langchain_google_vertexai import (
    VectorSearchVectorStore,
    VectorSearchVectorStoreDatastore,
)
Langchainassets.png

创建简单向量存储(不含过滤)

# Input texts
texts = [
    "The cat sat on",
    "the mat.",
    "I like to",
    "eat pizza for",
    "dinner.",
    "The sun sets",
    "in the west.",
]

# Create a Vector Store
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
    stream_update=True,
)

# Add vectors and mapped text chunks to your vectore store
vector_store.add_texts(texts=texts)

可选:也可以创建向量并将文本块存储到 Datastore

# NOTE : This operation can take upto 20 mins
vector_store = VectorSearchVectorStoreDatastore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
    stream_update=True,
)

vector_store.add_texts(texts=texts, is_complete_overwrite=True)
# Try running a simialarity search
vector_store.similarity_search("pizza")

创建带元数据过滤的向量存储

# Input text with metadata
record_data = [
    {
        "description": "A versatile pair of dark-wash denim jeans."
        "Made from durable cotton with a classic straight-leg cut, these jeans"
        " transition easily from casual days to dressier occasions.",
        "price": 65.00,
        "color": "blue",
        "season": ["fall", "winter", "spring"],
    },
    {
        "description": "A lightweight linen button-down shirt in a crisp white."
        " Perfect for keeping cool with breathable fabric and a relaxed fit.",
        "price": 34.99,
        "color": "white",
        "season": ["summer", "spring"],
    },
    {
        "description": "A soft, chunky knit sweater in a vibrant forest green. "
        "The oversized fit and cozy wool blend make this ideal for staying warm "
        "when the temperature drops.",
        "price": 89.99,
        "color": "green",
        "season": ["fall", "winter"],
    },
    {
        "description": "A classic crewneck t-shirt in a soft, heathered blue. "
        "Made from comfortable cotton jersey, this t-shirt is a wardrobe essential "
        "that works for every season.",
        "price": 19.99,
        "color": "blue",
        "season": ["fall", "winter", "summer", "spring"],
    },
    {
        "description": "A flowing midi-skirt in a delicate floral print. "
        "Lightweight and airy, this skirt adds a touch of feminine style "
        "to warmer days.",
        "price": 45.00,
        "color": "white",
        "season": ["spring", "summer"],
    },
]
# Parse and prepare input data

texts = []
metadatas = []
for record in record_data:
    record = record.copy()
    page_content = record.pop("description")
    texts.append(page_content)
    if isinstance(page_content, str):
        metadata = {**record}
        metadatas.append(metadata)
# Inspect metadatas
metadatas
# NOTE : This operation can take more than 20 mins
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
)

vector_store.add_texts(texts=texts, metadatas=metadatas, is_complete_overwrite=True)
from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    Namespace,
    NumericNamespace,
)
# Try running a simple similarity search

# Below code should return 5 results
vector_store.similarity_search("shirt", k=5)
# Try running a similarity search with text filter
filters = [Namespace(name="season", allow_tokens=["spring"])]

# Below code should return 4 results now
vector_store.similarity_search("shirt", k=5, filter=filters)
# Try running a similarity search with combination of text and numeric filter
filters = [Namespace(name="season", allow_tokens=["spring"])]
numeric_filters = [NumericNamespace(name="price", value_float=40.0, op="LESS")]

# Below code should return 2 results now
vector_store.similarity_search(
    "shirt", k=5, filter=filters, numeric_filter=numeric_filters
)

将向量存储用作检索器

# Initialize the vectore_store as retriever
retriever = vector_store.as_retriever()
# perform simple similarity search on retriever
retriever.invoke("What are my options in breathable fabric?")
# Try running a similarity search with text filter
filters = [Namespace(name="season", allow_tokens=["spring"])]

retriever.search_kwargs = {"filter": filters}

# perform similarity search with filters on retriever
retriever.invoke("What are my options in breathable fabric?")
# Try running a similarity search with combination of text and numeric filter
filters = [Namespace(name="season", allow_tokens=["spring"])]
numeric_filters = [NumericNamespace(name="price", value_float=40.0, op="LESS")]


retriever.search_kwargs = {"filter": filters, "numeric_filter": numeric_filters}

retriever.invoke("What are my options in breathable fabric?")

在问答链中结合检索器使用过滤

from langchain_google_vertexai import VertexAI

llm = VertexAI(model_name="gemini-pro")
from langchain_classic.chains import RetrievalQA

filters = [Namespace(name="season", allow_tokens=["spring"])]
numeric_filters = [NumericNamespace(name="price", value_float=40.0, op="LESS")]

retriever.search_kwargs = {"k": 2, "filter": filters, "numeric_filter": numeric_filters}

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

question = "What are my options in breathable fabric?"
response = retrieval_qa({"query": question})
print(f"{response['result']}")
print("REFERENCES")
print(f"{response['source_documents']}")

读取、分块、向量化并索引 PDF

!pip install pypdf
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762.pdf")
pages = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
doc_splits = text_splitter.split_documents(pages)
texts = [doc.page_content for doc in doc_splits]
metadatas = [doc.metadata for doc in doc_splits]
texts[0]
# Inspect Metadata of 1st page
metadatas[0]
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
)

vector_store.add_texts(texts=texts, metadatas=metadatas, is_complete_overwrite=True)
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
)

混合搜索

Vector Search 支持混合搜索,这是信息检索(IR)领域中一种流行的架构模式,将语义搜索和关键词搜索(也称为基于 Token 的搜索)相结合。通过混合搜索,开发者可以充分发挥两种方法的优势,有效提升搜索质量。 点击此处了解更多。 要使用混合搜索,需要拟合一个稀疏嵌入向量化器,并在 Vector Search 集成外部处理嵌入向量。 稀疏嵌入向量化器的示例是 sklearn TfidfVectorizer,也可以使用其他技术,例如 BM25。
# Define some sample data
texts = [
    "The cat sat on",
    "the mat.",
    "I like to",
    "eat pizza for",
    "dinner.",
    "The sun sets",
    "in the west.",
]

# optional IDs
ids = ["i_" + str(i + 1) for i in range(len(texts))]

# optional metadata
metadatas = [{"my_metadata": i} for i in range(len(texts))]
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TFIDF Vectorizer (This is usually done on a very large corpus of data to make sure that word statistics generalize well on new data)
vectorizer = TfidfVectorizer()
vectorizer.fit(texts)
# Utility function to transform text into a TF-IDF Sparse Vector
def get_sparse_embedding(tfidf_vectorizer, text):
    tfidf_vector = tfidf_vectorizer.transform([text])
    values = []
    dims = []
    for i, tfidf_value in enumerate(tfidf_vector.data):
        values.append(float(tfidf_value))
        dims.append(int(tfidf_vector.indices[i]))
    return {"values": values, "dimensions": dims}
# semantic (dense) embeddings
embeddings = embedding_model.embed_documents(texts)
# tfidf (sparse) embeddings
sparse_embeddings = [get_sparse_embedding(vectorizer, x) for x in texts]
sparse_embeddings[0]
# Add the dense and sparse embeddings in Vector Search

vector_store.add_texts_with_embeddings(
    texts=texts,
    embeddings=embeddings,
    sparse_embeddings=sparse_embeddings,
    ids=ids,
    metadatas=metadatas,
)
# Run hybrid search
query = "the cat"
embedding = embedding_model.embed_query(query)
sparse_embedding = get_sparse_embedding(vectorizer, query)

vector_store.similarity_search_by_vector_with_score(
    embedding=embedding,
    sparse_embedding=sparse_embedding,
    k=5,
    rrf_ranking_alpha=0.7,  # 0.7 weight to dense and 0.3 weight to sparse
)