Grobid 集成

GROBID 是一个用于提取、解析和重构原始文档的机器学习库。它专为解析学术论文而设计和预期，在这方面表现尤为出色。注意：如果提供给 Grobid 的文章是超过一定元素数量的大型文档（如学位论文），它们可能无法被处理。本页介绍如何使用 Grobid 为 LangChain 解析文章。

安装

Grobid 的安装说明详见 https://grobid.readthedocs.io/en/latest/Install-Grobid/。但是，通过 Docker 容器运行 grobid 可能更简单、麻烦更少，详情请参阅此处。

在 LangChain 中使用 Grobid

安装并运行 grobid 后（你可以通过访问 http://localhost:8070 来检查），你就可以开始使用了。你现在可以使用 GrobidParser 生成文档

from langchain_community.document_loaders.parsers import GrobidParser
from langchain_community.document_loaders.generic import GenericLoader

#从文章段落生成块
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()

#从文章句子生成块
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

块元数据将包含边界框。虽然这些边界框解析起来有些复杂，但详细说明请参阅 https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/

在 GitHub 上编辑此页面或提交问题。

将这些文档连接到 Claude、VSCode 等，通过 MCP 获取实时答案。

Popular Providers

Integrations by component

安装

在 LangChain 中使用 Grobid

Popular Providers

Integrations by component

​安装

​在 LangChain 中使用 Grobid

安装

在 LangChain 中使用 Grobid