Skip to main content
Oracle AI 数据库支持 AI 工作负载,你可以通过语义(含义)而非仅仅关键词来查询数据。它在单一系统中将对非结构化内容的语义搜索对业务数据的关系过滤相结合——因此你可以构建检索工作流(如 RAG),无需引入独立的向量数据库,也无需将数据分散到多个平台。 本指南演示如何使用 OracleVS(Oracle AI Vector Search 的 LangChain 向量存储集成)来:
  • 将文档和嵌入向量导入 Oracle
  • 运行相似性搜索
  • 创建 HNSW 和 IVF 索引
  • 应用元数据过滤进行高级检索
  • 在 Oracle Database 26ai 中启用混合搜索(关键词 + 语义)
  • 使用 Oracle Text 运行全文搜索

前置条件

安装 langchain-oracledbpython-oracledb 驱动将作为依赖项自动安装。
pip install -qU langchain-oracledb

连接到 Oracle Database

以下示例代码展示如何连接到 Oracle Database。默认情况下,python-oracledb 以”Thin”模式运行,直接连接到 Oracle Database,不需要 Oracle Client 库。但是,当 python-oracledb 使用 Oracle Client 库时,可以使用一些额外的功能。使用 Oracle Client 库时,python-oracledb 称为”Thick”模式。两种模式都全面支持 Python Database API v2.0 规范。请参阅以下指南,了解每种模式支持的功能。如果无法使用 thin 模式,可以切换到 thick 模式。
import oracledb

# Please update with your username, password, hostname, port and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname>:<port>/<service_name>"

connection = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")

导入所需依赖项

from langchain_oracledb.vectorstores import oraclevs
from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings

加载文档

# Define a list of documents (The examples below are 5 random documents from Oracle Concepts Manual )

documents_json_list = [
    {
        "id": "cncpt_15.5.3.2.2_P4",
        "text": "If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.",
        "link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/logical-storage-structures.html#GUID-5387D7B2-C0CA-4C1E-811B-C7EB9B636442",
    },
    {
        "id": "cncpt_15.5.5_P1",
        "text": "A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.",
        "link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/logical-storage-structures.html#GUID-D02B2220-E6F5-40D9-AFB5-BC69BCEF6CD4",
    },
    {
        "id": "cncpt_22.3.4.3.1_P2",
        "text": "The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.",
        "link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/concepts-for-database-developers.html#GUID-3C50EAB8-FC39-4BB3-B680-4EACCE49E866",
    },
    {
        "id": "cncpt_22.3.4.3.1_P3",
        "text": "The LOB segment stores data in pieces called chunks. A chunk is a logically contiguous set of data blocks and is the smallest unit of allocation for a LOB. A row in the table stores a pointer called a LOB locator, which points to the LOB index. When the table is queried, the database uses the LOB index to quickly locate the LOB chunks.",
        "link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/concepts-for-database-developers.html#GUID-3C50EAB8-FC39-4BB3-B680-4EACCE49E866",
    },
]
# Create LangChain Documents

documents_langchain = []

for doc in documents_json_list:
    metadata = {"id": doc["id"], "link": doc["link"]}
    doc_langchain = Document(page_content=doc["text"], metadata=metadata)
    documents_langchain.append(doc_langchain)

使用不同距离度量创建向量存储

首先,我们将创建三个分别使用不同距离函数的向量存储。由于我们尚未在其中创建索引,目前只会创建表。稍后,我们将使用这些向量存储创建 HNSW 索引。要了解更多关于 Oracle AI Vector Search 支持的不同索引类型,请参阅以下指南 你可以手动连接到 Oracle Database,将看到三张表: Documents_DOTDocuments_COSINEDocuments_EUCLIDEAN 然后我们将再创建三张额外的表 Documents_DOT_IVFDocuments_COSINE_IVFDocuments_EUCLIDEAN_IVF,用于在这些表上创建 IVF 索引而非 HNSW 索引。
# Ingest documents into Oracle Vector Store using different distance strategies

# When using our API calls, start by initializing your vector store with a subset of your documents
# through from_documents(), then incrementally add more documents using add_texts().
# This approach prevents system overload and ensures efficient document processing.

model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

vector_store_dot = OracleVS.from_documents(
    documents_langchain,
    model,
    client=connection,
    table_name="Documents_DOT",
    distance_strategy=DistanceStrategy.DOT_PRODUCT,
)
vector_store_max = OracleVS.from_documents(
    documents_langchain,
    model,
    client=connection,
    table_name="Documents_COSINE",
    distance_strategy=DistanceStrategy.COSINE,
)
vector_store_euclidean = OracleVS.from_documents(
    documents_langchain,
    model,
    client=connection,
    table_name="Documents_EUCLIDEAN",
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)

# Ingest documents into Oracle Vector Store using different distance strategies
vector_store_dot_ivf = OracleVS.from_documents(
    documents_langchain,
    model,
    client=connection,
    table_name="Documents_DOT_IVF",
    distance_strategy=DistanceStrategy.DOT_PRODUCT,
)
vector_store_max_ivf = OracleVS.from_documents(
    documents_langchain,
    model,
    client=connection,
    table_name="Documents_COSINE_IVF",
    distance_strategy=DistanceStrategy.COSINE,
)
vector_store_euclidean_ivf = OracleVS.from_documents(
    documents_langchain,
    model,
    client=connection,
    table_name="Documents_EUCLIDEAN_IVF",
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)

文本的添加和删除操作,以及基本相似性搜索

def manage_texts(vector_stores):
    """
    Adds texts to each vector store, demonstrates error handling for duplicate additions,
    and performs deletion of texts. Showcases similarity searches and index creation for each vector store.

    Args:
    - vector_stores (list): A list of OracleVS instances.
    """
    texts = ["Rohan", "Shailendra"]
    metadata = [
        {"id": "100", "link": "Document Example Test 1"},
        {"id": "101", "link": "Document Example Test 2"},
    ]

    for i, vs in enumerate(vector_stores, start=1):
        # Adding texts
        try:
            vs.add_texts(texts, metadata)
            print(f"\n\n\nAdd texts complete for vector store {i}\n\n\n")
        except Exception as ex:
            print(f"\n\n\nExpected error on duplicate add for vector store {i}\n\n\n")

        # Deleting texts using the value of 'id'
        vs.delete([metadata[0]["id"]])
        print(f"\n\n\nDelete texts complete for vector store {i}\n\n\n")

        # Similarity search
        results = vs.similarity_search("How are LOBS stored in Oracle Database", 2)
        print(f"\n\n\nSimilarity search results for vector store {i}: {results}\n\n\n")


vector_store_list = [
    vector_store_dot,
    vector_store_max,
    vector_store_euclidean,
    vector_store_dot_ivf,
    vector_store_max_ivf,
    vector_store_euclidean_ivf,
]
manage_texts(vector_store_list)

使用特定参数创建索引

def create_search_indices(connection):
    """
    Creates search indices for the vector stores, each with specific parameters tailored to their distance strategy.
    """
    # Index for DOT_PRODUCT strategy
    # Notice we are creating a HNSW index with default parameters
    # This will default to creating a HNSW index with 8 Parallel Workers and use the Default Accuracy used by Oracle AI Vector Search
    oraclevs.create_index(
        connection,
        vector_store_dot,
        params={"idx_name": "hnsw_idx1", "idx_type": "HNSW"},
    )

    # Index for COSINE strategy with specific parameters
    # Notice we are creating a HNSW index with parallel 16 and Target Accuracy Specification as 97 percent
    oraclevs.create_index(
        connection,
        vector_store_max,
        params={
            "idx_name": "hnsw_idx2",
            "idx_type": "HNSW",
            "accuracy": 97,
            "parallel": 16,
        },
    )

    # Index for EUCLIDEAN_DISTANCE strategy with specific parameters
    # Notice we are creating a HNSW index by specifying Power User Parameters which are neighbors = 64 and efConstruction = 100
    oraclevs.create_index(
        connection,
        vector_store_euclidean,
        params={
            "idx_name": "hnsw_idx3",
            "idx_type": "HNSW",
            "neighbors": 64,
            "efConstruction": 100,
        },
    )

    # Index for DOT_PRODUCT strategy with specific parameters
    # Notice we are creating an IVF index with default parameters
    # This will default to creating an IVF index with 8 Parallel Workers and use the Default Accuracy used by Oracle AI Vector Search
    oraclevs.create_index(
        connection,
        vector_store_dot_ivf,
        params={
            "idx_name": "ivf_idx1",
            "idx_type": "IVF",
        },
    )

    # Index for COSINE strategy with specific parameters
    # Notice we are creating an IVF index with parallel 32 and Target Accuracy Specification as 90 percent
    oraclevs.create_index(
        connection,
        vector_store_max_ivf,
        params={
            "idx_name": "ivf_idx2",
            "idx_type": "IVF",
            "accuracy": 90,
            "parallel": 32,
        },
    )

    # Index for EUCLIDEAN_DISTANCE strategy with specific parameters
    # Notice we are creating an IVF index by specifying Power User Parameters which is neighbor_part = 64
    oraclevs.create_index(
        connection,
        vector_store_euclidean_ivf,
        params={"idx_name": "ivf_idx3", "idx_type": "IVF", "neighbor_part": 64},
    )

    print("Index creation complete.")


create_search_indices(connection)

高级搜索

Oracle Database 23ai 支持预过滤、中间过滤和后过滤,以增强 AI Vector Search 的能力。这些过滤机制允许用户在向量相似性搜索之前、期间和之后应用约束条件,从而提高搜索性能和准确性。 Oracle 23ai 过滤的关键要点:
  1. 预过滤 在执行向量相似性搜索之前,应用传统 SQL 过滤器以缩小数据集范围。 通过限制 AI 算法处理的数据量来提高效率。
  2. 中间过滤 利用 AI Vector Search 直接对向量嵌入执行相似性搜索,使用优化的索引和算法。 无需全表扫描即可根据向量相似性高效过滤结果。
  3. 后过滤 在向量相似性搜索之后,应用额外的 SQL 过滤来细化结果。 允许根据业务逻辑或额外的元数据条件进一步细化。
为什么这很重要?
  • 性能优化:预过滤显著减少了查询执行时间,使海量数据集上的搜索更加高效。
  • 精度提升:中间过滤确保向量搜索在语义上有意义,提高搜索结果的质量。

过滤器详情

OracleVS 支持一组可通过 filter 参数应用于 metadata 字段的过滤器,让你能够根据各种条件选择和细化数据。 可用过滤运算符:
运算符描述
\$exists字段存在。
\$eq字段值等于操作数值(=)。
\$ne字段存在且值不等于操作数值(!=)。
\$gt字段值大于操作数值(>)。
\$lt字段值小于操作数值(<)。
\$gte字段值大于等于操作数值(>=)。
\$lte字段值小于等于操作数值(<=)。
\$between字段值在操作数数组的两个值之间(含边界)。
\$startsWith字段值以操作数值开头。
\$hasSubstring字段值包含操作数作为子字符串。
\$instr字段值包含操作数作为子字符串。
\$regex字段值匹配给定的正则表达式模式。
\$like字段值匹配操作数模式(使用类 SQL 语法)。
\$in字段值等于操作数数组中至少一个值。
\$nin字段存在但其值不等于操作数数组中的任何值,或字段不存在。
\$all字段值是包含操作数数组中所有项目的数组,或与单个操作数匹配的标量。
  • 你可以使用逻辑运算符组合这些过滤器:
逻辑运算符描述
\$and逻辑与
\$or逻辑或
\$nor逻辑或非
过滤器示例:
{
  "age": 65,
  "name": {"$regex": "*rk"},
  "$or": [
    {
      "$and": [
        {"name": "Jason"},
        {"drinks": {"$in": ["tea", "soda"]}}
      ]
    },
    {
      "$nor": [
        {"age": {"$lt": 65}},
        {"name": "Jason"}
      ]
    }
  ]
}
其他使用技巧:
  • 当对象中的所有过滤器都必须满足时,可以省略 $and。以下两者等价:
{ "$and": [
    { "name": { "$startsWith": "Fred" } },
    { "salary": { "$gt": 10000, "$lte": 20000 } }
]}
{
  "name": { "$startsWith": "Fred" },
  "salary": { "$gt": 10000, "$lte": 20000 }
}
  • $not 子句可以对比较运算符取反:
{ "address.zip": { "$not": { "$eq": "90001" } } }
  • 使用 field: scalar 等价于 field: { "$eq": scalar }
{ "animal": "cat" }
更多过滤器示例,请参阅测试规范
# Conduct advanced searches after creating the indices
def conduct_advanced_searches(vector_stores):
    query = "How are LOBS stored in Oracle Database"
    # Constructing a filter for direct comparison against document metadata
    # This filter aims to include documents whose metadata 'id' is exactly '2'
    db_filter = {
        "$and": [
            {"id": "101"},  # FilterCondition
            {
                "$or": [  # FilterGroup
                    {"status": "approved"},
                    {"link": "Document Example Test 2"},
                    {
                        "$and": [  # Nested FilterGroup
                            {"status": "approved"},
                            {"link": "Document Example Test 2"},
                        ]
                    },
                ]
            },
        ]
    }

    for i, vs in enumerate(vector_stores, start=1):
        print(f"\n--- Vector Store {i} Advanced Searches ---")
        # Similarity search without a filter
        print("\nSimilarity search results without filter:")
        print(vs.similarity_search(query, 2))

        # Similarity search with a filter
        print("\nSimilarity search results with filter:")
        print(vs.similarity_search(query, 2, filter=db_filter))

        # Similarity search with relevance score
        print("\nSimilarity search with relevance score:")
        print(vs.similarity_search_with_score(query, 2))

        # Similarity search with relevance score with filter
        print("\nSimilarity search with relevance score with filter:")
        print(vs.similarity_search_with_score(query, 2, filter=db_filter))

        # Max marginal relevance search
        print("\nMax marginal relevance search results:")
        print(vs.max_marginal_relevance_search(query, 2, fetch_k=20, lambda_mult=0.5))

        # Max marginal relevance search with filter
        print("\nMax marginal relevance search results with filter:")
        print(
            vs.max_marginal_relevance_search(
                query, 2, fetch_k=20, lambda_mult=0.5, filter=db_filter
            )
        )


conduct_advanced_searches(vector_store_list)

混合搜索

Oracle Database 26ai 支持混合搜索,将关键词(全文)搜索与语义(向量)搜索合并为单一检索能力。langchain-oracledb 集成提供:
  • OracleVectorizerPreference:创建混合索引使用的数据库端向量化偏好设置。
  • create_hybrid_index / acreate_hybrid_index:创建 HYBRID VECTOR INDEX。
  • OracleHybridSearchRetriever:在 OracleVS 表上执行关键词、语义或混合检索。

前置条件和模型配置

使用混合搜索时,请使用 OracleEmbeddings 配置你的 OracleVS,以使向量化偏好设置与嵌入配置完全匹配。你还可以通过 OracleVectorizerPreference 提供额外参数来进一步调整混合向量索引。详情请参阅文档
from langchain_core.documents import Document
from langchain_oracledb.embeddings import OracleEmbeddings
from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_oracledb.retrievers.hybrid_search import (
    OracleVectorizerPreference,
    create_hybrid_index,
    OracleHybridSearchRetriever,
)

# Use OracleEmbeddings (database-resident model shown)
embeddings = OracleEmbeddings(conn=connection, params={"provider": "database", "model": "DB_MODEL"})

# Create/load your vector store
vs = OracleVS(connection, table_name="DOCS", embedding_function=embeddings)

# Create a vectorizer preference
pref = OracleVectorizerPreference.create_preference(
    vector_store=vs, preference_name="PREF_DOCS"
)

# Create a HYBRID VECTOR INDEX
create_hybrid_index(
    connection,
    idx_name="IDX_DOCS_HYB",
    vectorizer_preference=pref
)

# Build a retriever and search
retriever = OracleHybridSearchRetriever(
    vector_store=vs,
    idx_name="IDX_DOCS_HYB",
    search_mode="hybrid",     # "hybrid" | "keyword" | "semantic"
    k=5,
    return_scores=True,       # includes score, text_score, vector_score in metadata
)

docs = retriever.invoke("refund policy for premium plan")
for d in docs:
    print(d.page_content, d.metadata.get("score"), d.metadata.get("text_score"), d.metadata.get("vector_score"))

# Optional cleanup when done with the preference:
# pref.drop_preference()
替代方案:不使用显式偏好设置创建索引:
  • 如果你不想管理命名偏好设置,可以传递 vector_store。该函数将创建一个临时偏好设置,构建索引,然后自动删除偏好设置。
create_hybrid_index(
    connection,
    idx_name="IDX_DOCS_HYB2",
    vector_store=vs,          # mutually exclusive with vectorizer_preference
    params={"parallel": 8},
)
注意事项和技巧:
  • search_mode 决定使用哪些信号:
    • “keyword”:仅关键词
    • “semantic”:仅向量
    • “hybrid”(默认):两者结合
  • 通过检索器的 params 参数传递 DBMS_HYBRID_VECTOR 参数。
  • return_scores=True 会将总分以及 text_score 和 vector_score 组件添加到 Document.metadata。
  • 异步用法通过 acreate_hybrid_indexOracleHybridSearchRetriever.ainvoke 支持。
更多信息:

全文搜索(Oracle Text)

你可以使用 Oracle Text 直接对 Oracle Database 运行高质量的关键词搜索。langchain-oracledb 集成提供:
  • create_text_index / acreate_text_index:在列上创建 Oracle Text SEARCH INDEX
  • OracleTextSearchRetriever:运行 CONTAINS 查询并返回 LangChain Document 对象。
索引选项:
  • 如果你有一个 OracleVS 向量存储,可以对其内置的”text”列建立索引。
  • 也可以通过直接提供 table_name + column_name 来对任何其他表/列建立索引。
from langchain_oracledb.retrievers.text_search import create_text_index, OracleTextSearchRetriever

# Using an OracleVS table (indexes the 'text' column)
retriever_text = None
create_text_index(
    connection,
    idx_name="IDX_DOCS_TEXT",
    vector_store=vs,
)

# Build a retriever. With OracleVS, returned_columns defaults to ["metadata"].
retriever_text = OracleTextSearchRetriever(
    vector_store=vs,
    k=5,
    fuzzy=True,          # applies Oracle Text FUZZY per token when operator_search=False
    return_scores=True,  # adds SCORE(1) as metadata["score"]
)
docs = retriever_text.invoke("refund policy")
for d in docs:
    print(d.page_content, d.metadata.get("score"))

# Alternatively, index an arbitrary table/column:
# create_text_index(connection, idx_name="IDX_MYDOCS_TEXT", table_name="MYDOCS", column_name="CONTENT")
# retriever_text = OracleTextSearchRetriever(client=connection, table_name="MYDOCS", column_name="CONTENT", k=5)
运算符模式和高级查询:
  • 默认行为(operator_search=False):
    • 输入被视为字面文本,按非单词字符进行分词,并重写为 ACCUM 表达式。
    • 使用 fuzzy=True 时,每个词元都会被包装为 FUZZY("token") 以匹配拼写错误。
  • 运算符模式(operator_search=True):
    • 原样传递 Oracle Text 表达式(NEARABOUTANDORNOTWITHIN 等)。在此模式下,忽略 fuzzy 参数。
返回列:
  • 针对原始表时,通过 returned_columns 在结果中包含额外列;这些列会附加到 Document.metadata
  • 使用 OracleVS 时,returned_columns 默认为 [“metadata”]。
# Operator mode example
retriever_text_ops = OracleTextSearchRetriever(
    vector_store=vs,
    operator_search=True,   # pass Oracle Text expression directly
    return_scores=True,
)
docs = retriever_text_ops.invoke('NEAR((policy, refund), 2, TRUE)')
注意事项和技巧:
  • 使用 operator_search=True 时,忽略 fuzzy 参数(设计如此)。
  • 异步用法通过 acreate_text_indexOracleTextSearchRetriever.ainvoke 支持。
更多信息:

端到端演示

请参阅我们的完整演示指南 Oracle AI Vector Search 端到端演示指南,借助 Oracle AI Vector Search 构建端到端 RAG 管道。