Skip to main content
Memgraph 是一款开源图数据库,针对动态分析环境进行了优化,并与 Neo4j 兼容。Memgraph 使用 Cypher 查询数据库——这是属性图数据库中使用最广泛、规范最完整的开放查询语言。 本 notebook 将介绍如何使用自然语言查询 Memgraph 以及如何从非结构化数据构建知识图谱 在此之前,请先完成环境配置

配置

完成本指南需要安装 DockerPython 3.x 要快速首次运行 Memgraph Platform(Memgraph 数据库 + MAGE 库 + Memgraph Lab),请执行以下操作: 在 Linux/MacOS 上:
curl https://install.memgraph.com | sh
在 Windows 上:
iwr https://windows.memgraph.com | iex
两个命令都会运行一个脚本,将 Docker Compose 文件下载到你的系统,在两个独立容器中构建并启动 memgraph-magememgraph-lab Docker 服务。现在 Memgraph 已经运行起来了!在 Memgraph 文档中了解更多安装过程。 要使用 LangChain,需要安装并导入所有必要的包。我们将使用包管理器 pip 并配合 --user 标志以确保正确的权限。如果你安装了 Python 3.4 或更高版本,pip 默认已包含在内。使用以下命令安装所有必需的包:
pip install langchain langchain-openai langchain-memgraph --user
你可以在本 notebook 中运行提供的代码块,也可以使用单独的 Python 文件来体验 Memgraph 和 LangChain。

自然语言查询

Memgraph 与 LangChain 的集成包含自然语言查询功能。要使用它,首先进行所有必要的导入,我们会在代码出现时逐一说明。 首先,实例化 MemgraphGraph。该对象持有到运行中的 Memgraph 实例的连接。请确保正确设置所有环境变量。
import os

from langchain_core.prompts import PromptTemplate
from langchain_memgraph.chains.graph_qa import MemgraphQAChain
from langchain_memgraph.graphs.memgraph import MemgraphLangChain
from langchain_openai import ChatOpenAI

url = os.environ.get("MEMGRAPH_URI", "bolt://localhost:7687")
username = os.environ.get("MEMGRAPH_USERNAME", "")
password = os.environ.get("MEMGRAPH_PASSWORD", "")

graph = MemgraphLangChain(
    url=url, username=username, password=password, refresh_schema=False
)
refresh_schema 初始设置为 False,因为数据库中还没有数据,我们希望避免不必要的数据库调用。

填充数据库

要填充数据库,首先确保它是空的。最有效的方法是切换到内存分析存储模式,删除图,然后切换回内存事务模式。了解更多关于 Memgraph 的存储模式 我们将添加到数据库的数据是关于不同类型的视频游戏,这些游戏可在各种平台上获得,并与发行商相关联。
# Drop graph
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")

# Creating and executing the seeding query
query = """
    MERGE (g:Game {name: "Baldur's Gate 3"})
    WITH g, ["PlayStation 5", "Mac OS", "Windows", "Xbox Series X/S"] AS platforms,
            ["Adventure", "Role-Playing Game", "Strategy"] AS genres
    FOREACH (platform IN platforms |
        MERGE (p:Platform {name: platform})
        MERGE (g)-[:AVAILABLE_ON]->(p)
    )
    FOREACH (genre IN genres |
        MERGE (gn:Genre {name: genre})
        MERGE (g)-[:HAS_GENRE]->(gn)
    )
    MERGE (p:Publisher {name: "Larian Studios"})
    MERGE (g)-[:PUBLISHED_BY]->(p);
"""

graph.query(query)
[]
注意 graph 对象持有 query 方法。该方法在 Memgraph 中执行查询,也被 MemgraphQAChain 用于查询数据库。

刷新图 schema

由于在 Memgraph 中创建了新数据,有必要刷新 schema。生成的 schema 将被 MemgraphQAChain 用于指导 LLM 更好地生成 Cypher 查询。
graph.refresh_schema()
要熟悉数据并验证更新后的图 schema,可以使用以下语句打印:
print(graph.get_schema)
Node labels and properties (name and type) are:
- labels: (:Platform)
  properties:
    - name: string
- labels: (:Genre)
  properties:
    - name: string
- labels: (:Game)
  properties:
    - name: string
- labels: (:Publisher)
  properties:
    - name: string

Nodes are connected with the following relationships:
(:Game)-[:HAS_GENRE]->(:Genre)
(:Game)-[:PUBLISHED_BY]->(:Publisher)
(:Game)-[:AVAILABLE_ON]->(:Platform)

查询数据库

要与 OpenAI API 交互,必须将 API 密钥配置为环境变量。这确保了请求的正确授权。你可以在这里找到有关获取 API 密钥的更多信息。要配置 API 密钥,可以使用 Python os 包:
os.environ["OPENAI_API_KEY"] = "your-key-here"
如果在 Jupyter notebook 中运行代码,请运行上述代码片段。 接下来,创建 MemgraphQAChain,它将在基于图数据的问答过程中使用。temperature 参数设置为零以确保可预测且一致的答案。你可以将 verbose 参数设置为 True 以接收有关查询生成的更详细消息。
chain = MemgraphQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=graph,
    model_name="gpt-4-turbo",
    allow_dangerous_requests=True,
)
现在可以开始提问了!
response = chain.invoke("Which platforms is Baldur's Gate 3 available on?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(platform:Platform)
RETURN platform.name
Baldur's Gate 3 is available on PlayStation 5, Mac OS, Windows, and Xbox Series X/S.
response = chain.invoke("Is Baldur's Gate 3 available on Windows?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(:Platform{name: "Windows"})
RETURN "Yes"
Yes, Baldur's Gate 3 is available on Windows.

链修饰符

要修改链的行为并获取更多上下文或附加信息,可以修改链的参数。

返回直接查询结果

return_direct 修饰符指定是返回执行的 Cypher 查询的直接结果,还是经过处理的自然语言响应。
# Return the result of querying the graph directly
chain = MemgraphQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=graph,
    return_direct=True,
    allow_dangerous_requests=True,
    model_name="gpt-4-turbo",
)

response = chain.invoke("Which studio published Baldur's Gate 3?")
print(response["result"])
MATCH (g:Game {name: "Baldur's Gate 3"})-[:PUBLISHED_BY]->(p:Publisher)
RETURN p.name
[{'p.name': 'Larian Studios'}]

返回查询中间步骤

return_intermediate_steps 链修饰符通过在初始查询结果之外包含查询的中间步骤来增强返回的响应。
# Return all the intermediate steps of query execution
chain = MemgraphQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=graph,
    allow_dangerous_requests=True,
    return_intermediate_steps=True,
    model_name="gpt-4-turbo",
)

response = chain.invoke("Is Baldur's Gate 3 an Adventure game?")
print(f"Intermediate steps: {response['intermediate_steps']}")
print(f"Final response: {response['result']}")
MATCH (:Game {name: "Baldur's Gate 3"})-[:HAS_GENRE]->(:Genre {name: "Adventure"})
RETURN "Yes"
Intermediate steps: [{'query': 'MATCH (:Game {name: "Baldur\'s Gate 3"})-[:HAS_GENRE]->(:Genre {name: "Adventure"})\nRETURN "Yes"'}, {'context': [{'"Yes"': 'Yes'}]}]
Final response: Yes.

限制查询结果数量

top_k 修饰符可在想要限制查询结果最大数量时使用。
# Limit the maximum number of results returned by query
chain = MemgraphQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=graph,
    top_k=2,
    allow_dangerous_requests=True,
    model_name="gpt-4-turbo",
)

response = chain.invoke("What genres are associated with Baldur's Gate 3?")
print(response["result"])
MATCH (:Game {name: "Baldur's Gate 3"})-[:HAS_GENRE]->(g:Genre)
RETURN g.name;
Adventure, Role-Playing Game

高级查询

随着解决方案复杂性的增加,你可能会遇到需要谨慎处理的不同用例。确保应用程序的可扩展性对于维持流畅的用户流程至关重要。 让我们再次实例化链并尝试提一些用户可能会问的问题。
chain = MemgraphQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=graph,
    model_name="gpt-4-turbo",
    allow_dangerous_requests=True,
)

response = chain.invoke("Is Baldur's Gate 3 available on PS5?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(:Platform{name: "PS5"})
RETURN "Yes"
I don't know the answer.
生成的 Cypher 查询看起来没问题,但我们没有收到任何信息作为响应。这说明了使用 LLM 时常见的挑战——用户表达查询的方式与数据存储方式之间的不一致。在本例中,用户认知与实际数据存储之间的差异可能导致不匹配。提示词优化是一种解决此问题的有效方法,通过精炼模型的提示词,使模型更好地理解这些差异。通过提示词优化,模型生成精确且相关查询的能力得到提升,从而成功检索所需数据。

提示词优化

为了解决这个问题,我们可以调整 QA chain 的初始 Cypher 提示词。这涉及向 LLM 添加关于用户如何引用特定平台(例如本例中的 PS5)的指导。我们通过 LangChain PromptTemplate 来实现,创建一个修改后的初始提示词,然后将其作为参数提供给精炼后的 MemgraphQAChain 实例。
MEMGRAPH_GENERATION_TEMPLATE = """Your task is to directly translate natural language inquiry into precise and executable Cypher query for Memgraph database.
You will utilize a provided database schema to understand the structure, nodes and relationships within the Memgraph database.
Instructions:
- Use provided node and relationship labels and property names from the
schema which describes the database's structure. Upon receiving a user
question, synthesize the schema to craft a precise Cypher query that
directly corresponds to the user's intent.
- Generate valid executable Cypher queries on top of Memgraph database.
Any explanation, context, or additional information that is not a part
of the Cypher query syntax should be omitted entirely.
- Use Memgraph MAGE procedures instead of Neo4j APOC procedures.
- Do not include any explanations or apologies in your responses.
- Do not include any text except the generated Cypher statement.
- For queries that ask for information or functionalities outside the direct
generation of Cypher queries, use the Cypher query format to communicate
limitations or capabilities. For example: RETURN "I am designed to generate
Cypher queries based on the provided schema only."
Schema:
{schema}

With all the above information and instructions, generate Cypher query for the
user question.
If the user asks about PS5, Play Station 5 or PS 5, that is the platform called PlayStation 5.

The question is:
{question}"""

MEMGRAPH_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=MEMGRAPH_GENERATION_TEMPLATE
)

chain = MemgraphQAChain.from_llm(
    ChatOpenAI(temperature=0),
    cypher_prompt=MEMGRAPH_GENERATION_PROMPT,
    graph=graph,
    model_name="gpt-4-turbo",
    allow_dangerous_requests=True,
)

response = chain.invoke("Is Baldur's Gate 3 available on PS5?")
print(response["result"])
MATCH (:Game{name: "Baldur's Gate 3"})-[:AVAILABLE_ON]->(:Platform{name: "PlayStation 5"})
RETURN "Yes"
Yes, Baldur's Gate 3 is available on PS5.
现在,经过修订的包含平台命名指导的初始 Cypher 提示词,我们获得了与用户查询更紧密对应的准确且相关的结果。 这种方法允许进一步改进你的 QA chain。你可以轻松地将额外的提示词优化数据集成到链中,从而提升应用的整体用户体验。

构建知识图谱

将非结构化数据转化为结构化数据并不是一项简单或直接的任务。本指南将展示如何利用 LLM 来帮助我们完成这项工作,以及如何在 Memgraph 中构建知识图谱。创建知识图谱后,你可以将其用于 GraphRAG 应用。 从文本构建知识图谱的步骤如下:

从文本中提取结构化信息

除了配置部分中的所有导入外,还需导入 LLMGraphTransformerDocument,用于从文本中提取结构化信息。
from langchain_core.documents import Document
from langchain_experimental.graph_transformers import LLMGraphTransformer
以下是关于查尔斯·达尔文的示例文本(来源),将从中构建知识图谱。
text = """
    Charles Robert Darwin was an English naturalist, geologist, and biologist,
    widely known for his contributions to evolutionary biology. His proposition that
    all species of life have descended from a common ancestor is now generally
    accepted and considered a fundamental scientific concept. In a joint
    publication with Alfred Russel Wallace, he introduced his scientific theory that
    this branching pattern of evolution resulted from a process he called natural
    selection, in which the struggle for existence has a similar effect to the
    artificial selection involved in selective breeding. Darwin has been
    described as one of the most influential figures in human history and was
    honoured by burial in Westminster Abbey.
"""
下一步是从所需 LLM 初始化 LLMGraphTransformer 并将文档转换为图结构。
llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")
llm_transformer = LLMGraphTransformer(llm=llm)
documents = [Document(page_content=text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)
在底层,LLM 从文本中提取重要实体,并以节点和关系列表的形式返回。以下是其外观:
print(graph_documents)
[GraphDocument(nodes=[Node(id='Charles Robert Darwin', type='Person', properties={}), Node(id='English', type='Nationality', properties={}), Node(id='Naturalist', type='Profession', properties={}), Node(id='Geologist', type='Profession', properties={}), Node(id='Biologist', type='Profession', properties={}), Node(id='Evolutionary Biology', type='Field', properties={}), Node(id='Common Ancestor', type='Concept', properties={}), Node(id='Scientific Concept', type='Concept', properties={}), Node(id='Alfred Russel Wallace', type='Person', properties={}), Node(id='Natural Selection', type='Concept', properties={}), Node(id='Selective Breeding', type='Concept', properties={}), Node(id='Westminster Abbey', type='Location', properties={})], relationships=[Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='English', type='Nationality', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Naturalist', type='Profession', properties={}), type='PROFESSION', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Geologist', type='Profession', properties={}), type='PROFESSION', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Biologist', type='Profession', properties={}), type='PROFESSION', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Evolutionary Biology', type='Field', properties={}), type='CONTRIBUTION', properties={}), Relationship(source=Node(id='Common Ancestor', type='Concept', properties={}), target=Node(id='Scientific Concept', type='Concept', properties={}), type='BASIS', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Alfred Russel Wallace', type='Person', properties={}), type='COLLABORATION', properties={}), Relationship(source=Node(id='Natural Selection', type='Concept', properties={}), target=Node(id='Selective Breeding', type='Concept', properties={}), type='COMPARISON', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Westminster Abbey', type='Location', properties={}), type='BURIAL', properties={})], source=Document(metadata={}, page_content='\n    Charles Robert Darwin was an English naturalist, geologist, and biologist,\n    widely known for his contributions to evolutionary biology. His proposition that\n    all species of life have descended from a common ancestor is now generally\n    accepted and considered a fundamental scientific concept. In a joint\n    publication with Alfred Russel Wallace, he introduced his scientific theory that\n    this branching pattern of evolution resulted from a process he called natural\n    selection, in which the struggle for existence has a similar effect to the\n    artificial selection involved in selective breeding. Darwin has been\n    described as one of the most influential figures in human history and was\n    honoured by burial in Westminster Abbey.\n'))]

存储到 Memgraph

一旦数据以 GraphDocument 的格式准备好(即节点和关系),你可以使用 add_graph_documents 方法将其导入 Memgraph。该方法将 graph_documents 列表转换为需要在 Memgraph 中执行的适当 Cypher 查询。完成后,知识图谱即存储在 Memgraph 中。
# Empty the database
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")

# Create KG
graph.add_graph_documents(graph_documents)
图构建过程是非确定性的,因为用于从非结构化数据生成节点和关系的 LLM 本身是非确定性的。

其他选项

此外,你可以根据需求灵活定义要提取的特定节点和关系类型。
llm_transformer_filtered = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Nationality", "Concept"],
    allowed_relationships=["NATIONALITY", "INVOLVED_IN", "COLLABORATES_WITH"],
)
graph_documents_filtered = llm_transformer_filtered.convert_to_graph_documents(
    documents
)

print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")
Nodes:[Node(id='Charles Robert Darwin', type='Person', properties={}), Node(id='English', type='Nationality', properties={}), Node(id='Evolutionary Biology', type='Concept', properties={}), Node(id='Natural Selection', type='Concept', properties={}), Node(id='Alfred Russel Wallace', type='Person', properties={})]
Relationships:[Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='English', type='Nationality', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Evolutionary Biology', type='Concept', properties={}), type='INVOLVED_IN', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Natural Selection', type='Concept', properties={}), type='INVOLVED_IN', properties={}), Relationship(source=Node(id='Charles Robert Darwin', type='Person', properties={}), target=Node(id='Alfred Russel Wallace', type='Person', properties={}), type='COLLABORATES_WITH', properties={})]
你的图还可以为所有节点添加 __Entity__ 标签,这将为更快速的检索建立索引。
# Drop graph
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")

# Store to Memgraph with Entity label
graph.add_graph_documents(graph_documents, baseEntityLabel=True)
还有一个选项是将图中获取的信息来源包含在内。为此,将 include_source 设置为 True,然后源文档将被存储,并使用 MENTIONS 关系与图中的节点相连。
# Drop graph
graph.query("STORAGE MODE IN_MEMORY_ANALYTICAL")
graph.query("DROP GRAPH")
graph.query("STORAGE MODE IN_MEMORY_TRANSACTIONAL")

# Store to Memgraph with source included
graph.add_graph_documents(graph_documents, include_source=True)
注意源内容是如何存储的,由于文档没有任何 id,因此会生成 id 属性。 你可以同时使用 __Entity__ 标签和文档来源。但请注意,两者都会占用内存,特别是包含来源时,由于内容字符串较长,内存占用更大。 最后,你可以像之前章节中解释的那样查询知识图谱:
chain = MemgraphQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=graph,
    model_name="gpt-4-turbo",
    allow_dangerous_requests=True,
)
print(chain.invoke("Who Charles Robert Darwin collaborated with?")["result"])
MATCH (:Person {id: "Charles Robert Darwin"})-[:COLLABORATION]->(collaborator)
RETURN collaborator;
Alfred Russel Wallace