Skip to main content
SAP HANA Cloud 向量引擎 是完全集成到 SAP HANA Cloud 数据库中的向量存储。

安装配置

安装 langchain-hana 外部集成包以及本 notebook 中使用的其他包。
pip install langchain-hana

凭据

确保您的 SAP HANA 实例正在运行。从环境变量加载凭据并创建连接:
import os

from dotenv import load_dotenv
from hdbcli import dbapi

load_dotenv()
# Use connection settings from the environment
connection = dbapi.connect(
    address=os.environ.get("HANA_DB_ADDRESS"),
    port=os.environ.get("HANA_DB_PORT"),
    user=os.environ.get("HANA_DB_USER"),
    password=os.environ.get("HANA_DB_PASSWORD")
)
了解更多关于 SAP HANA 的信息,请参阅 什么是 SAP HANA?

初始化

要初始化 HanaDB 向量存储,您需要数据库连接和嵌入实例。SAP HANA Cloud 向量引擎同时支持外部和内部嵌入。
  • 使用外部嵌入

# | output: false
# | echo: false
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
  • 使用内部嵌入

或者,您可以使用 SAP HANA 的原生 VECTOR_EMBEDDING() 函数直接在 SAP HANA 中计算嵌入。要启用此功能,请创建一个 HanaInternalEmbeddings 实例,指定您的内部模型 ID,并将其传递给 HanaDB。请注意,HanaInternalEmbeddings 实例专为与 HanaDB 配合使用而设计,不适用于其他向量存储实现。有关内部嵌入的更多信息,请参阅 SAP HANA VECTOR_EMBEDDING 函数
注意: 确保在您的 SAP HANA Cloud 实例中启用了 NLP。
from langchain_hana import HanaInternalEmbeddings

embeddings = HanaInternalEmbeddings(internal_embedding_model_id="SAP_NEB.20240715")

# optionally, you can specify a remote source to use models from your deployed SAP AI CORE instance

# embeddings = HanaInternalEmbeddings(
#     internal_embedding_model_id="your-embedding-model-id",
#     remote_source="your-remote-source-name",
# )
准备好连接和嵌入实例后,通过将它们传递给 HanaDB 并指定存储向量的表名来创建向量存储:
from langchain_hana import HanaDB

db = HanaDB(
    embedding=embeddings, connection=connection, table_name="MY_TABLE"
)

管理向量存储

创建好向量存储后,可以通过添加和删除不同条目来与其交互。

向向量存储添加条目

可以使用 add_documents 函数向向量存储添加条目。
from langchain_core.documents import Document

docs = [Document(page_content="Some text"), Document(page_content="Other docs")]
db.add_documents(docs)
添加带元数据的文档。
docs = [
    Document(
        page_content="foo",
        metadata={"start": 100, "end": 150, "doc_name": "foo.txt", "quality": "bad"},
    ),
    Document(
        page_content="bar",
        metadata={"start": 200, "end": 250, "doc_name": "bar.txt", "quality": "good"},
    ),
]
db.add_documents(docs)

从向量存储删除条目

db.delete(filter={"quality": "bad"})

查询向量存储

直接查询

相似性搜索

执行带元数据过滤的简单相似性搜索如下:
docs = db.similarity_search("foobar", k=2, filter={"quality": "bad"})
# With filtering on "quality"=="bad", only one document should be returned
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
    print(doc.metadata)
--------------------------------------------------------------------------------
foo
{'start': 100, 'end': 150, 'doc_name': 'foo.txt', 'quality': 'bad'}

MMR 搜索

执行带元数据过滤的最大边际相关性(MMR)搜索如下:
docs = db.max_marginal_relevance_search("foobar", k=2, fetch_k=5, filter={"quality": "bad"})
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
    print(doc.metadata)
--------------------------------------------------------------------------------
foo
{'start': 100, 'end': 150, 'doc_name': 'foo.txt', 'quality': 'bad'}

转换为检索器进行查询

也可以将向量存储转换为检索器,以便在链中更方便地使用。
retriever = db.as_retriever()
docs = retriever.invoke("foobar", filter={"quality": "good"})
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
    print(doc.metadata)
--------------------------------------------------------------------------------
bar
{'start': 200, 'end': 250, 'doc_name': 'bar.txt', 'quality': 'good'}

距离相似度算法

HanaDB 支持以下距离相似度算法:
  • 余弦相似度(默认)
  • 欧氏距离(L2)
您可以在初始化 HanaDB 实例时通过 distance_strategy 参数指定距离策略。
from langchain_hana.utils import DistanceStrategy
db = HanaDB(
    embedding=embeddings,
    connection=connection,
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
    # distance_strategy=DistanceStrategy.COSINE_SIMILARITY,  # (default)
    table_name="MY_TABLE",
)

创建 HNSW 索引

向量索引可以显著加速向量的 top-k 最近邻查询。用户可以使用 create_hnsw_index 函数创建层次化可导航小世界(HNSW)向量索引。 有关在数据库级别创建索引的更多信息,请参阅官方文档
db = HanaDB(
    embedding=embeddings, connection=connection, table_name="MY_TABLE"
)
db.create_hnsw_index(
    index_name="MY_TABLE_index",
    m=100,  # Max number of neighbors per graph node (valid range: 4 to 1000)
    ef_construction=200,  # Max number of candidates during graph construction (valid range: 1 to 100000)
    ef_search=500,  # Min number of candidates during the search (valid range: 1 to 100000)
)
如果未指定其他参数,将使用默认值 默认值:m=64,ef_construction=128,ef_search=200 默认索引名称将为:“<TABLE_NAME>_idx”

高级过滤

除了基本的基于值的过滤功能外,还可以使用更高级的过滤。下表显示了可用的过滤运算符。
运算符语义
$eq等于(==)
$ne不等于(!=)
$lt小于(<)
$lte小于或等于(<=)
$gt大于(>)
$gte大于或等于(>=)
$in包含在给定值集合中(in)
$nin不包含在给定值集合中(not in)
$between在两个边界值的范围之间
$like基于 SQL 中”LIKE”语义的文本等值(使用”%“作为通配符)
$contains过滤包含特定关键字的文档
$and逻辑”与”,支持两个或多个操作数
$or逻辑”或”,支持两个或多个操作数
# Prepare some test documents
docs = [
    Document(
        page_content="First",
        metadata={"name": "Adam Smith", "is_active": True, "id": 1, "height": 10.0},
    ),
    Document(
        page_content="Second",
        metadata={"name": "Bob Johnson", "is_active": False, "id": 2, "height": 5.7},
    ),
    Document(
        page_content="Third",
        metadata={"name": "Jane Doe", "is_active": True, "id": 3, "height": 2.4},
    ),
]

db = HanaDB(
    connection=connection,
    embedding=embeddings,
    table_name="LANGCHAIN_DEMO_ADVANCED_FILTER",
)

# Delete already existing documents from the table
db.delete(filter={})
db.add_documents(docs)


# Helper function for printing filter results
def print_filter_result(result):
    if len(result) == 0:
        print("<empty result>")
    for doc in result:
        print(doc.metadata)
使用 $ne$gt$gte$lt$lte 进行过滤
advanced_filter = {"id": {"$ne": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$gt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$gte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$lt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$lte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'id': {'$ne': 1}}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'id': {'$gt': 1}}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'id': {'$gte': 1}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'id': {'$lt': 1}}
<empty result>
Filter: {'id': {'$lte': 1}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
使用 $between$in$nin 进行过滤
advanced_filter = {"id": {"$between": (1, 2)}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$in": ["Adam Smith", "Bob Johnson"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$nin": ["Adam Smith", "Bob Johnson"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'id': {'$between': (1, 2)}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$in': ['Adam Smith', 'Bob Johnson']}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$nin': ['Adam Smith', 'Bob Johnson']}}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
使用 $like 进行文本过滤
advanced_filter = {"name": {"$like": "a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$like": "%a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'name': {'$like': 'a%'}}
<empty result>
Filter: {'name': {'$like': '%a%'}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
使用 $contains 进行文本过滤
advanced_filter = {"name": {"$contains": "bob"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$contains": "bo"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$contains": "Adam Johnson"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$contains": "Adam Smith"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'name': {'$contains': 'bob'}}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$contains': 'bo'}}
<empty result>
Filter: {'name': {'$contains': 'Adam Johnson'}}
<empty result>
Filter: {'name': {'$contains': 'Adam Smith'}}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
使用 $and$or 组合过滤
advanced_filter = {"$or": [{"id": 1}, {"name": "bob"}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"$and": [{"id": 1}, {"id": 2}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"$or": [{"id": 1}, {"id": 2}, {"id": 3}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {
    "$and": [{"name": {"$contains": "bob"}}, {"name": {"$contains": "johnson"}}]
}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))
Filter: {'$or': [{'id': 1}, {'name': 'bob'}]}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
Filter: {'$and': [{'id': 1}, {'id': 2}]}
<empty result>
Filter: {'$or': [{'id': 1}, {'id': 2}, {'id': 3}]}
{'name': 'Adam Smith', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'Jane Doe', 'is_active': True, 'id': 3, 'height': 2.4}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'$and': [{'name': {'$contains': 'bob'}}, {'name': {'$contains': 'johnson'}}]}
{'name': 'Bob Johnson', 'is_active': False, 'id': 2, 'height': 5.7}

用于检索增强生成

关于如何将此向量存储用于检索增强生成(RAG)的指南,请参阅以下章节:

标准表与带向量数据的”自定义”表

默认行为下,嵌入表包含 3 列:
  • VEC_TEXT,存储 Document 的文本
  • VEC_META,存储 Document 的元数据
  • VEC_VECTOR,存储 Document 文本的嵌入向量
# Access the vector DB with a new table
db = HanaDB(
    connection=connection, embedding=embeddings, table_name="LANGCHAIN_DEMO_NEW_TABLE"
)

# Delete already existing entries from the table
db.delete(filter={})

# Add a simple document with some metadata
docs = [
    Document(
        page_content="A simple document",
        metadata={"start": 100, "end": 150, "doc_name": "simple.txt"},
    )
]
db.add_documents(docs)
显示表 “LANGCHAIN_DEMO_NEW_TABLE” 中的列
cur = connection.cursor()
cur.execute(
    "SELECT COLUMN_NAME, DATA_TYPE_NAME FROM SYS.TABLE_COLUMNS WHERE SCHEMA_NAME = CURRENT_SCHEMA AND TABLE_NAME = 'LANGCHAIN_DEMO_NEW_TABLE'"
)
rows = cur.fetchall()
for row in rows:
    print(row)
cur.close()
('VEC_META', 'NCLOB')
('VEC_TEXT', 'NCLOB')
('VEC_VECTOR', 'REAL_VECTOR')
显示三列中已插入文档的值 由于 HANA 的 dbapi 驱动默认以 fvecs 字节对象输出向量列,我们将创建一个辅助函数将其转换为数字列表。
import struct
# Helper function to parse fvecs format for REAL_VECTOR
def parseFvecs(fvecs):
    dim = struct.unpack_from("<I", fvecs, 0)[0]
    return list(struct.unpack_from("<%sf" % str(dim), fvecs, 4))

cur = connection.cursor()
cur.execute(
    "SELECT * FROM LANGCHAIN_DEMO_NEW_TABLE LIMIT 1"
)
rows = cur.fetchall()
print(rows[0][0])  # The text
print(rows[0][1])  # The metadata
embedding = parseFvecs(rows[0][2])
print(len(embedding), embedding[:3] + ['...'] + embedding[-3:])  # The vector
cur.close()
A simple document
{"start": 100, "end": 150, "doc_name": "simple.txt"}
768 [-0.01989901065826416, 0.02785174734890461, 0.0020877711940556765, '...', 0.0183248370885849, 0.009469633921980858, 0.04312701150774956]
自定义表至少必须有三列,其语义与标准表相匹配
  • 类型为 NCLOBNVARCHAR 的列,用于嵌入的文本/上下文
  • 类型为 NCLOBNVARCHAR 的列,用于元数据
  • 类型为 REAL_VECTORHALF_VECTOR 的列,用于嵌入向量
表中可以包含其他额外列。当新文档插入表时,这些额外列必须允许 NULL 值。
# Create a new table "MY_OWN_TABLE_ADD" with three "standard" columns and one additional column
my_own_table_name = "MY_OWN_TABLE_ADD"
cur = connection.cursor()
cur.execute(
    (
        f"CREATE TABLE {my_own_table_name} ("
        "SOME_OTHER_COLUMN NVARCHAR(42), "
        "MY_TEXT NVARCHAR(2048), "
        "MY_METADATA NVARCHAR(1024), "
        "MY_VECTOR REAL_VECTOR )"
    )
)

# Create a HanaDB instance with the own table
db = HanaDB(
    connection=connection,
    embedding=embeddings,
    table_name=my_own_table_name,
    content_column="MY_TEXT",
    metadata_column="MY_METADATA",
    vector_column="MY_VECTOR",
)

# Add a simple document with some metadata
docs = [
    Document(
        page_content="Some other text",
        metadata={"start": 400, "end": 450, "doc_name": "other.txt"},
    )
]
db.add_documents(docs)

# Check if data has been inserted into our own table
cur.execute(f"SELECT SOME_OTHER_COLUMN, MY_TEXT, MY_METADATA, TO_NVARCHAR(MY_VECTOR) AS MY_VECTOR FROM {my_own_table_name} LIMIT 1")
rows = cur.fetchall()
print(rows[0][0])  # Value of column "SOME_OTHER_COLUMN". Should be NULL/None
print(rows[0][1])  # The text
print(rows[0][2])  # The metadata
embedding = parseFvecs(rows[0][3])
print(len(embedding), embedding[:3] + ['...'] + embedding[-3:])  # The vector

cur.close()
None
Some other text
{"start": 400, "end": 450, "doc_name": "other.txt"}
768 [0.016170687973499298, -0.01129427831619978, -0.0005921399570070207, '...', 0.017849743366241455, 0.0003932560794055462, -0.00045805066474713385]
添加另一个文档并对自定义表执行相似性搜索。
docs = [
    Document(
        page_content="Some more text",
        metadata={"start": 800, "end": 950, "doc_name": "more.txt"},
    )
]
db.add_documents(docs)

query = "What's up?"
docs = db.similarity_search(query, k=2)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
--------------------------------------------------------------------------------
Some more text
--------------------------------------------------------------------------------
Some other text

使用自定义列优化过滤性能

为支持灵活的元数据值,默认情况下所有元数据都以 JSON 格式存储在元数据列中。如果已知某些元数据键和值类型,可以通过创建以键名为列名的目标表,并通过 specific_metadata_columns 列表将它们传递给 HanaDB 构造函数,将其存储在额外列中。插入时,与这些值匹配的元数据键会被复制到特殊列中。对于 specific_metadata_columns 列表中的键,过滤器会使用特殊列而非元数据 JSON 列。
# Create a new table "PERFORMANT_CUSTOMTEXT_FILTER" with three "standard" columns and one additional column
my_own_table_name = "PERFORMANT_CUSTOMTEXT_FILTER"
cur = connection.cursor()
cur.execute(
    (
        f"CREATE TABLE {my_own_table_name} ("
        "CUSTOMTEXT NVARCHAR(500), "
        "MY_TEXT NVARCHAR(2048), "
        "MY_METADATA NVARCHAR(1024), "
        "MY_VECTOR REAL_VECTOR )"
    )
)

# Create a HanaDB instance with the own table
db = HanaDB(
    connection=connection,
    embedding=embeddings,
    table_name=my_own_table_name,
    content_column="MY_TEXT",
    metadata_column="MY_METADATA",
    vector_column="MY_VECTOR",
    specific_metadata_columns=["CUSTOMTEXT"],
)

# Add a simple document with some metadata
docs = [
    Document(
        page_content="Some other text",
        metadata={
            "start": 400,
            "end": 450,
            "doc_name": "other.txt",
            "CUSTOMTEXT": "Filters on this value are very performant",
        },
    )
]
db.add_documents(docs)

# Check if data has been inserted into our own table
cur.execute(f"SELECT * FROM {my_own_table_name} LIMIT 1")
rows = cur.fetchall()
print(
    rows[0][0]
)  # Value of column "CUSTOMTEXT". Should be "Filters on this value are very performant"
print(rows[0][1])  # The text
print(
    rows[0][2]
)  # The metadata without the "CUSTOMTEXT" data, as this is extracted into a sperate column
embedding = parseFvecs(rows[0][3])
print(len(embedding), embedding[:3] + ['...'] + embedding[-3:])  # The vector

cur.close()
Filters on this value are very performant
Some other text
{"start": 400, "end": 450, "doc_name": "other.txt", "CUSTOMTEXT": "Filters on this value are very performant"}
768 [0.016170687973499298, -0.01129427831619978, -0.0005921399570070207, '...', 0.017849743366241455, 0.0003932560794055462, -0.00045805066474713385]
特殊列对 langchain 接口的其余部分完全透明。一切与之前相同,只是性能更好。
docs = [
    Document(
        page_content="Some more text",
        metadata={
            "start": 800,
            "end": 950,
            "doc_name": "more.txt",
            "CUSTOMTEXT": "Another customtext value",
        },
    )
]
db.add_documents(docs)

advanced_filter = {"CUSTOMTEXT": {"$like": "%value%"}}
query = "What's up?"
docs = db.similarity_search(query, k=2, filter=advanced_filter)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
--------------------------------------------------------------------------------
Some more text
--------------------------------------------------------------------------------
Some other text

简单示例

加载示例文档 “state_of_the_union.txt” 并将其分块。
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

text_documents = TextLoader(
    "../../how_to/state_of_the_union.txt", encoding="UTF-8"
).load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
text_chunks = text_splitter.split_documents(text_documents)
print(f"Number of document chunks: {len(text_chunks)}")
Number of document chunks: 88
将加载的文档块添加到表中。本示例中,我们会删除表中可能存在于之前运行的任何内容。
# Delete already existing documents from the table
db.delete(filter={})

# add the loaded document chunks
db.add_documents(text_chunks)
从上一步添加的文档块中查询获取最匹配的两个文档块。 默认情况下使用”余弦相似度”进行搜索。
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)

for doc in docs:
    print("-" * 80)
    print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation's top legal minds, who will continue Justice Breyer's legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.

While it often appears that we never agree, that isn't true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.
使用”欧氏距离”查询相同内容。结果应与”余弦相似度”相同。
from langchain_hana.utils import DistanceStrategy

db = HanaDB(
    embedding=embeddings,
    connection=connection,
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
    table_name="STATE_OF_THE_UNION",
)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation's top legal minds, who will continue Justice Breyer's legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.

While it often appears that we never agree, that isn't true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.

最大边际相关性搜索(MMR)

最大边际相关性同时优化与查询的相似性和所选文档之间的多样性。将从数据库检索前 20 条(fetch_k)结果,然后 MMR 算法将找出最佳的 2 条(k)匹配。
docs = db.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation's top legal minds, who will continue Justice Breyer's legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.

In this struggle as President Zelenskyy said in his speech to the European Parliament "Light will win over darkness." The Ukrainian Ambassador to the United States is here tonight.

Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.

创建 HNSW 向量索引

向量索引可以显著加速向量的 top-k 最近邻查询。用户可以使用 create_hnsw_index 函数创建层次化可导航小世界(HNSW)向量索引。
# HanaDB instance uses cosine similarity as default:
db_cosine = HanaDB(
    embedding=embeddings, connection=connection, table_name="STATE_OF_THE_UNION"
)

# Attempting to create the HNSW index with default parameters
db_cosine.create_hnsw_index()  # If no other parameters are specified, the default values will be used
# Default values: m=64, ef_construction=128, ef_search=200
# The default index name will be: STATE_OF_THE_UNION_COSINE_idx


# Creating a HanaDB instance with L2 distance as the similarity function and defined values
db_l2 = HanaDB(
    embedding=embeddings,
    connection=connection,
    table_name="STATE_OF_THE_UNION",
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,  # Specify L2 distance
)

# This will create an index based on L2 distance strategy.
db_l2.create_hnsw_index(
    index_name="STATE_OF_THE_UNION_L2_index",
    m=100,  # Max number of neighbors per graph node (valid range: 4 to 1000)
    ef_construction=200,  # Max number of candidates during graph construction (valid range: 1 to 100000)
    ef_search=500,  # Min number of candidates during the search (valid range: 1 to 100000)
)

# Use L2 index to perform MMR
docs = db_l2.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation's top legal minds, who will continue Justice Breyer's legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.

In this struggle as President Zelenskyy said in his speech to the European Parliament "Light will win over darkness." The Ukrainian Ambassador to the United States is here tonight.

Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.
关键点
  • 相似度函数:索引的相似度函数默认为余弦相似度。如果要使用不同的相似度函数(例如 L2 距离),需要在初始化 HanaDB 实例时指定。
  • 默认参数:在 create_hnsw_index 函数中,如果用户未为 mef_constructionef_search 等参数提供自定义值,将自动使用默认值(例如 m=64ef_construction=128ef_search=200)。这些值可确保在无需用户干预的情况下以合理的性能创建索引。