Skip to main content
Ontotext GraphDB 是一款符合 RDFSPARQL 标准的图数据库和知识发现工具。
本 notebook 展示如何使用 LLM 为 Ontotext GraphDB 提供自然语言查询功能(NLQ 转 SPARQL,也称为 text2sparql)。

GraphDB LLM 功能

GraphDB 支持一些 LLM 集成功能,如此处所述: gpt-queries
  • 使用知识图谱(KG)中的数据向 LLM 请求文本、列表或表格的魔法谓词
  • 查询解释
  • 结果解释、摘要、改写、翻译
retrieval-graphdb-connector
  • 在向量数据库中索引 KG 实体
  • 支持任何文本嵌入算法和向量数据库
  • 使用 GraphDB 为 Elastic、Solr、Lucene 使用的相同强大连接器(索引)语言
  • 自动将 RDF 数据的变更同步到 KG 实体索引
  • 支持嵌套对象(GraphDB 10.5 版本不支持 UI)
  • 将 KG 实体序列化为文本,例如(以葡萄酒数据集为例):
Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.
talk-to-graph
  • 使用已定义 KG 实体索引的简单聊天机器人
本教程不使用 GraphDB LLM 集成,而是从自然语言查询(NLQ)生成 SPARQL。我们将使用 Star Wars APISWAPI)本体和数据集,可在此处查看。

配置

你需要一个正在运行的 GraphDB 实例。本教程展示如何使用 GraphDB Docker 镜像在本地运行数据库。它提供了一个 Docker Compose 配置,可以用 Star Wars 数据集填充 GraphDB。包括本 notebook 在内的所有必要文件可从 GitHub 仓库 langchain-graphdb-qa-chain-demo 下载。
docker build --tag graphdb .
docker compose up -d graphdb
你需要等待几秒钟让数据库在 http://localhost:7200/ 上启动。Star Wars 数据集 starwars-data.trig 会自动加载到 langchain 仓库中。本地 SPARQL 端点 http://localhost:7200/repositories/langchain 可用于执行查询。你也可以在浏览器中打开 GraphDB Workbench:http://localhost:7200/sparql,在那里可以交互式地执行查询。
  • 设置工作环境
如果使用 conda,创建并激活一个新的 conda 环境,例如:
conda create -n graph_ontotext_graphdb_qa python=3.12
conda activate graph_ontotext_graphdb_qa
安装以下库:
pip install jupyter==1.1.1
pip install rdflib==7.1.1
pip install langchain-community==0.3.4
pip install langchain-openai==0.2.4
使用以下命令运行 Jupyter:
jupyter notebook

指定本体

为了让 LLM 能够生成 SPARQL,它需要了解知识图谱 schema(本体)。可以通过 OntotextGraphDBGraph 类的两个参数之一来提供:
  • query_ontology:一个在 SPARQL 端点上执行的 CONSTRUCT 查询,返回 KG schema 语句。建议将本体存储在独立的命名图中,这样更容易只获取相关语句(如下面的示例)。不支持 DESCRIBE 查询,因为 DESCRIBE 返回对称的简洁有界描述(SCBD),即也包括传入的类链接。对于具有百万实例的大型图,这效率不高。参见 github.com/eclipse-rdf4j/rdf4j/issues/4857
  • local_file:本地 RDF 本体文件。支持的 RDF 格式有 TurtleRDF/XMLJSON-LDN-TriplesNotation-3TrigTrixN-Quads
无论哪种情况,本体转储都应该:
  • 包含足够的关于类、属性、属性与类的关联(使用 rdfs:domain、schema:domainIncludes 或 OWL 限制)以及分类体系(重要个体)的信息。
  • 不包含过于冗长和无关的定义及示例,这些对 SPARQL 构建没有帮助。
from langchain_community.graphs import OntotextGraphDBGraph

# feeding the schema using a user construct query

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    query_ontology="CONSTRUCT {?s ?p ?o} FROM [swapi.co/ontology/](https://swapi.co/ontology/) WHERE {?s ?p ?o}",
)
# feeding the schema using a local RDF file

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    local_file="/path/to/langchain_graphdb_tutorial/starwars-ontology.nt",  # change the path here
)
无论哪种方式,本体(schema)都以 Turtle 格式提供给 LLM,因为带有适当前缀的 Turtle 格式最为紧凑,也最容易被 LLM 记忆。 Star Wars 本体有些特殊,它包含了大量关于类的特定三元组,例如 :Aleena 物种居住在 <planet/38>,它们是 :Reptile 的子类,具有某些典型特征(平均身高、平均寿命、皮肤颜色),且特定个体(角色)是该类的代表:
@prefix : [swapi.co/vocabulary/](https://swapi.co/vocabulary/) .
@prefix owl: [www.w3.org/2002/07/owl#](http://www.w3.org/2002/07/owl#) .
@prefix rdfs: [www.w3.org/2000/01/rdf-schema#](http://www.w3.org/2000/01/rdf-schema#) .
@prefix xsd: [www.w3.org/2001/XMLSchema#](http://www.w3.org/2001/XMLSchema#) .

:Aleena a owl:Class, :Species ;
    rdfs:label "Aleena" ;
    rdfs:isDefinedBy [swapi.co/ontology/](https://swapi.co/ontology/) ;
    rdfs:subClassOf :Reptile, :Sentient ;
    :averageHeight 80.0 ;
    :averageLifespan "79" ;
    :character [swapi.co/resource/aleena/47](https://swapi.co/resource/aleena/47) ;
    :film [swapi.co/resource/film/4](https://swapi.co/resource/film/4) ;
    :language "Aleena" ;
    :planet [swapi.co/resource/planet/38](https://swapi.co/resource/planet/38) ;
    :skinColor "blue", "gray" .

    ...

为了保持本教程简单,我们使用未加密的 GraphDB。如果 GraphDB 已加密,应在初始化 OntotextGraphDBGraph 之前设置环境变量 ‘GRAPHDB_USERNAME’ 和 ‘GRAPHDB_PASSWORD’。
os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"

graph = OntotextGraphDBGraph(
    query_endpoint=...,
    query_ontology=...
)

针对 StarWars 数据集的问答

现在可以使用 OntotextGraphDBQAChain 提问了。
import os

from langchain_classic.chains import OntotextGraphDBQAChain
from langchain_openai import ChatOpenAI

# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/

# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"

# Any available OpenAI model can be used here.
# We use 'gpt-4-1106-preview' because of the bigger context window.
# The 'gpt-4-1106-preview' model_name will deprecate in the future and will change to 'gpt-4-turbo' or similar,
# so be sure to consult with the OpenAI API https://platform.openai.com/docs/models for the correct naming.

chain = OntotextGraphDBQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
    graph=graph,
    verbose=True,
    allow_dangerous_requests=True,
)
让我们先问一个简单的问题。
chain.invoke({chain.input_key: "What is the climate on Tatooine?"})[chain.output_key]


> Entering new OntotextGraphDBQAChain chain...
Generated SPARQL:
PREFIX : [swapi.co/vocabulary/](https://swapi.co/vocabulary/)
PREFIX rdfs: [www.w3.org/2000/01/rdf-schema#](http://www.w3.org/2000/01/rdf-schema#)

SELECT ?climate
WHERE {
  ?planet rdfs:label "Tatooine" ;
          :climate ?climate .
}

> Finished chain.
'The climate on Tatooine is arid.'
再问一个稍复杂的问题。
chain.invoke({chain.input_key: "What is the climate on Luke Skywalker's home planet?"})[
    chain.output_key
]
> Entering new OntotextGraphDBQAChain chain...
Generated SPARQL:
PREFIX : [swapi.co/vocabulary/](https://swapi.co/vocabulary/)
PREFIX owl: [www.w3.org/2002/07/owl#](http://www.w3.org/2002/07/owl#)
PREFIX rdfs: [www.w3.org/2000/01/rdf-schema#](http://www.w3.org/2000/01/rdf-schema#)
PREFIX xsd: [www.w3.org/2001/XMLSchema#](http://www.w3.org/2001/XMLSchema#)

SELECT ?climate
WHERE {
  ?character rdfs:label "Luke Skywalker" .
  ?character :homeworld ?planet .
  ?planet :climate ?climate .
}

> Finished chain.
"The climate on Luke Skywalker's home planet is arid."
我们还可以提更复杂的问题,例如
chain.invoke(
    {
        chain.input_key: "What is the average box office revenue for all the Star Wars movies?"
    }
)[chain.output_key]
> Entering new OntotextGraphDBQAChain chain...
Generated SPARQL:
PREFIX : [swapi.co/vocabulary/](https://swapi.co/vocabulary/)
PREFIX xsd: [www.w3.org/2001/XMLSchema#](http://www.w3.org/2001/XMLSchema#)

SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
  ?film a :Film .
  ?film :boxOffice ?boxOfficeValue .
  BIND(xsd:decimal(?boxOfficeValue) AS ?boxOffice)
}


> Finished chain.
'The average box office revenue for all the Star Wars movies is approximately 754.1 million dollars.'

链修饰符

Ontotext GraphDB QA chain 允许通过提示词优化来进一步改善 QA chain,并提升应用的整体用户体验。

“SPARQL 生成”提示词

该提示词用于根据用户问题和 KG schema 生成 SPARQL 查询。
  • sparql_generation_prompt 默认值:
      GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
      Write a SPARQL SELECT query for querying a graph database.
      The ontology schema delimited by triple backticks in Turtle format is:
    
    Use only the classes and properties provided in the schema to construct the SPARQL query.
    Do not use any classes or properties that are not explicitly provided in the SPARQL query.
    Include all necessary prefixes.
    Do not include any explanations or apologies in your responses.
    Do not wrap the query in backticks.
    Do not include any text except the SPARQL query generated.
    The question delimited by triple backticks is:
    
    """
    GRAPHDB_SPARQL_GENERATION_PROMPT = PromptTemplate(
        input_variables=["schema", "prompt"],
        template=GRAPHDB_SPARQL_GENERATION_TEMPLATE,
    )
    

“SPARQL 修复”提示词

有时,LLM 可能会生成带有语法错误或缺少前缀等问题的 SPARQL 查询。该链会通过提示 LLM 一定次数来修正这些问题。
  • sparql_fix_prompt 默认值:
      GRAPHDB_SPARQL_FIX_TEMPLATE = """
      This following SPARQL query delimited by triple backticks
    
    is not valid.
    The error delimited by triple backticks is
    
    Give me a correct version of the SPARQL query.
    Do not change the logic of the query.
    Do not include any explanations or apologies in your responses.
    Do not wrap the query in backticks.
    Do not include any text except the SPARQL query generated.
    The ontology schema delimited by triple backticks in Turtle format is:
    
    """
    
    GRAPHDB_SPARQL_FIX_PROMPT = PromptTemplate(
        input_variables=["error_message", "generated_sparql", "schema"],
        template=GRAPHDB_SPARQL_FIX_TEMPLATE,
    )
    
  • max_fix_retries 默认值:5

”答案生成”提示词

该提示词用于根据数据库返回的结果和初始用户问题来回答问题。默认情况下,指示 LLM 只使用返回结果中的信息。如果结果集为空,LLM 应告知用户无法回答该问题。
  • qa_prompt 默认值:
      GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
      You are an assistant that creates well-written and human understandable answers.
      The information part contains the information provided, which you can use to construct an answer.
      The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
      Make your response sound like the information is coming from an AI assistant, but don't add any information.
      Don't use internal knowledge to answer the question, just say you don't know if no information is available.
      Information:
      {context}
    
      Question: {prompt}
      Helpful Answer:"""
      GRAPHDB_QA_PROMPT = PromptTemplate(
          input_variables=["context", "prompt"], template=GRAPHDB_QA_TEMPLATE
      )
    
完成与 GraphDB 的问答体验后,可以从 Docker Compose 文件所在目录运行以下命令关闭 Docker 环境: docker compose down -v --remove-orphans