Skip to main content
Clarifai 是一个 AI 平台,提供完整的 AI 生命周期支持,涵盖数据探索、数据标注、模型训练、评估和推理。上传输入后,Clarifai 应用可作为向量数据库使用。
本 notebook 展示如何使用与 Clarifai 向量数据库相关的功能。示例演示了文本语义搜索功能。Clarifai 还支持图像、视频帧的语义搜索、定位搜索(参见 Rank)以及属性搜索(参见 Filter)。 使用 Clarifai 需要注册账号并获取个人访问令牌(PAT)密钥。 点击此处 获取或创建 PAT。

依赖项

# 安装所需依赖
pip install -qU  clarifai langchain-community

导入

在此设置个人访问令牌。您可以在平台的 settings/security 页面下找到您的 PAT。
# 请登录并从 https://clarifai.com/settings/security 获取您的 API 密钥
from getpass import getpass

CLARIFAI_PAT = getpass()
 ········
# 导入所需模块
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Clarifai
from langchain_text_splitters import CharacterTextSplitter

设置

设置要上传文本数据的用户 ID 和应用 ID。注意:创建该应用时,请选择适合索引文本文档的基础工作流,例如 Language-Understanding 工作流。 您需要先在 Clarifai 上创建账号,然后创建应用。
USER_ID = "USERNAME_ID"
APP_ID = "APPLICATION_ID"
NUMBER_OF_DOCS = 2

从文本创建

从文本列表创建 Clarifai 向量存储。本节将把每段文本及其对应的元数据上传到 Clarifai 应用中。随后可使用该 Clarifai 应用进行语义搜索,查找相关文本。
texts = [
    "I really enjoy spending time with you",
    "I hate spending time with my dog",
    "I want to go for a run",
    "I went to the movies yesterday",
    "I love playing soccer with my friends",
]

metadatas = [
    {"id": i, "text": text, "source": "book 1", "category": ["books", "modern"]}
    for i, text in enumerate(texts)
]
您也可以选择为输入指定自定义 ID。
idlist = ["text1", "text2", "text3", "text4", "text5"]
metadatas = [
    {"id": idlist[i], "text": text, "source": "book 1", "category": ["books", "modern"]}
    for i, text in enumerate(texts)
]
# 可以使用 pat 参数初始化 clarifai 向量存储!
clarifai_vector_db = Clarifai(
    user_id=USER_ID,
    app_id=APP_ID,
    number_of_docs=NUMBER_OF_DOCS,
)
将数据上传到 clarifai 应用。
# 使用元数据和自定义输入 ID 上传。
response = clarifai_vector_db.add_texts(texts=texts, ids=idlist, metadatas=metadatas)

# 不带元数据上传(不推荐)- 因为这样将无法根据元数据执行搜索操作。
# 自定义 input_id(可选)
response = clarifai_vector_db.add_texts(texts=texts)
您也可以直接创建 clarifai 向量数据库存储并将所有输入导入到您的应用中:
clarifai_vector_db = Clarifai.from_texts(
    user_id=USER_ID,
    app_id=APP_ID,
    texts=texts,
    metadatas=metadatas,
)
使用相似度搜索函数搜索相似文本。
docs = clarifai_vector_db.similarity_search("I would like to see you")
docs
[Document(page_content='I really enjoy spending time with you', metadata={'text': 'I really enjoy spending time with you', 'id': 'text1', 'source': 'book 1', 'category': ['books', 'modern']})]
此外,您还可以按元数据过滤搜索结果。
# 在应用中,您可以利用元数据过滤器进行强大的筛选。
# 以下示例将相似度查询限制在 "source" 键值为 "book 1" 的文本中
book1_similar_docs = clarifai_vector_db.similarity_search(
    "I would love to see you", filter={"source": "book 1"}
)

# 您也可以在输入的元数据中使用列表,然后选择匹配列表中某一项的内容。这对于如下所示的分类场景非常有用:
book_category_similar_docs = clarifai_vector_db.similarity_search(
    "I would love to see you", filter={"category": ["books"]}
)

从文档创建

从文档列表创建 Clarifai 向量存储。本节将把每个文档及其对应的元数据上传到 Clarifai 应用中。随后可使用该 Clarifai 应用进行语义搜索,查找相关文档。
loader = TextLoader("your_local_file_path.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
USER_ID = "USERNAME_ID"
APP_ID = "APPLICATION_ID"
NUMBER_OF_DOCS = 4
创建 clarifai 向量数据库类,并将所有文档导入 clarifai 应用。
clarifai_vector_db = Clarifai.from_documents(
    user_id=USER_ID,
    app_id=APP_ID,
    documents=docs,
    number_of_docs=NUMBER_OF_DOCS,
)
docs = clarifai_vector_db.similarity_search("Texts related to population")
docs

从现有应用创建

在 Clarifai 中,我们提供了通过 API 或 UI 向应用(本质上是项目)添加数据的强大工具。大多数用户在与 LangChain 交互之前已经完成了这一步,因此本示例将使用现有应用中的数据执行搜索。请查阅我们的 API 文档UI 文档。该 Clarifai 应用可用于语义搜索,查找相关文档。
USER_ID = "USERNAME_ID"
APP_ID = "APPLICATION_ID"
NUMBER_OF_DOCS = 4
clarifai_vector_db = Clarifai(
    user_id=USER_ID,
    app_id=APP_ID,
    number_of_docs=NUMBER_OF_DOCS,
)
docs = clarifai_vector_db.similarity_search(
    "Texts related to ammuniction and president wilson"
)
docs[0].page_content
"President Wilson, generally acclaimed as the leader of the world's democracies,\nphrased for civilization the arguments against autocracy in the great peace conference\nafter the war. The President headed the American delegation to that conclave of world\nre-construction. With him as delegates to the conference were Robert Lansing, Secretary\nof State; Henry White, former Ambassador to France and Italy; Edward M. House and\nGeneral Tasker H. Bliss.\nRepresenting American Labor at the International Labor conference held in Paris\nsimultaneously with the Peace Conference were Samuel Gompers, president of the\nAmerican Federation of Labor; William Green, secretary-treasurer of the United Mine\nWorkers of America; John R. Alpine, president of the Plumbers' Union; James Duncan,\npresident of the International Association of Granite Cutters; Frank Duffy, president of\nthe United Brotherhood of Carpenters and Joiners, and Frank Morrison, secretary of the\nAmerican Federation of Labor.\nEstimating the share of each Allied nation in the great victory, mankind will\nconclude that the heaviest cost in proportion to prewar population and treasure was paid\nby the nations that first felt the shock of war, Belgium, Serbia, Poland and France. All\nfour were the battle-grounds of huge armies, oscillating in a bloody frenzy over once\nfertile fields and once prosperous towns.\nBelgium, with a population of 8,000,000, had a casualty list of more than 350,000;\nFrance, with its casualties of 4,000,000 out of a population (including its colonies) of\n90,000,000, is really the martyr nation of the world. Her gallant poilus showed the world\nhow cheerfully men may die in defense of home and liberty. Huge Russia, including\nhapless Poland, had a casualty list of 7,000,000 out of its entire population of\n180,000,000. The United States out of a population of 110,000,000 had a casualty list of\n236,117 for nineteen months of war; of these 53,169 were killed or died of disease;\n179,625 were wounded; and 3,323 prisoners or missing."