Dedoc 集成

本示例演示了将 Dedoc 与 LangChain 结合使用作为 DocumentLoader 的方法。

概述

Dedoc 是一个开源库/服务，可从各种格式的文件中提取文本、表格、附件和文档结构（如标题、列表项等）。 Dedoc 支持 DOCX、XLSX、PPTX、EML、HTML、PDF、图片等格式。完整的支持格式列表可在此处查看。

集成详情

类	包	本地	可序列化	JS 支持
DedocFileLoader	langchain_community	❌	beta	❌
DedocPDFLoader	langchain_community	❌	beta	❌
DedocAPIFileLoader	langchain_community	❌	beta	❌

加载器特性

懒加载和异步加载方法均可用，但实际上文档加载是同步执行的。

来源	文档懒加载	异步支持
DedocFileLoader	❌	❌
DedocPDFLoader	❌	❌
DedocAPIFileLoader	❌	❌

安装配置

要访问 DedocFileLoader 和 DedocPDFLoader 文档加载器，您需要安装 dedoc 集成包。
要访问 DedocAPIFileLoader，您需要运行 Dedoc 服务，例如 Docker 容器（详情请参阅文档）：

docker pull dedocproject/dedoc
docker run -p 1231:1231

Dedoc 的安装说明可在此处查看。

# Install package
pip install --quiet "dedoc[torch]"

实例化

from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader("./example_data/state_of_the_union.txt")

加载

docs = loader.load()
docs[0].page_content[:100]

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t'

懒加载

docs = loader.lazy_load()

for doc in docs:
    print(doc.page_content[:100])
    break

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t

API 参考

有关配置和调用 Dedoc 加载器的详细信息，请参阅 API 参考：

加载ing any file

要自动处理支持格式中的任意文件， DedocFileLoader 非常实用。文件加载器会根据正确的扩展名自动检测文件类型。文件解析过程可在 DedocFileLoader 类初始化时通过 dedoc_kwargs 进行配置。以下给出了一些选项用法的基本示例，请参阅 DedocFileLoader 的文档以及 dedoc 文档以获取有关配置参数的更多详细信息。

基本示例

from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader("./example_data/state_of_the_union.txt")

docs = loader.load()

docs[0].page_content[:400]

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

分割模式

DedocFileLoader 支持不同类型的文档分割（每个部分单独返回）。为此，使用 split 参数，可选项如下：

document（默认值）：文档文本作为单个 langchain Document 对象返回（不分割）；
page：将文档文本按页分割（适用于 PDF、DJVU、PPTX、PPT、ODP）；
node：将文档文本分割为 Dedoc 树节点（标题节点、列表项节点、原始文本节点）；
line：将文档文本按文本行分割。

loader = DedocFileLoader(
    "./example_data/layout-parser-paper.pdf",
    split="page",
    pages=":2",
)

docs = loader.load()

len(docs)

处理表格

DedocFileLoader 支持在加载器初始化时将 with_tables 参数设置为 True 来处理表格（默认值为 with_tables=True）。表格不会被分割——每个表格对应一个 langchain Document 对象。对于表格，Document 对象会有额外的 metadata 字段 type="table" 以及包含表格 HTML 表示的 text_as_html。

loader = DedocFileLoader("./example_data/mlb_teams_2012.csv")

docs = loader.load()

docs[1].metadata["type"], docs[1].metadata["text_as_html"][:200]

('table',
 '<table border="1" style="border-collapse: collapse; width: 100%;">\n<tbody>\n<tr>\n<td colspan="1" rowspan="1">Team</td>\n<td colspan="1" rowspan="1"> &quot;Payroll (millions)&quot;</td>\n<td colspan="1" r')

处理附件

DedocFileLoader 支持在加载器初始化时将 with_attachments 设置为 True 来处理附件（默认值为 with_attachments=False）。附件根据 split 参数进行分割。对于附件，langchain Document 对象会有一个额外的元数据字段 type="attachment"。

loader = DedocFileLoader(
    "./example_data/fake-email-attachment.eml",
    with_attachments=True,
)

docs = loader.load()

docs[1].metadata["type"], docs[1].page_content

('attachment',
 '\nContent-Type\nmultipart/mixed; boundary="0000000000005d654405f082adb7"\nDate\nFri, 23 Dec 2022 12:08:48 -0600\nFrom\nMallori Harrell [mallori@unstructured.io](mailto:mallori@unstructured.io)\nMIME-Version\n1.0\nMessage-ID\n[CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com](mailto:CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com)\nSubject\nFake email with attachment\nTo\nMallori Harrell [mallori@unstructured.io](mailto:mallori@unstructured.io)')

加载ing PDF file

如果只需要处理 PDF 文档，可以使用仅支持 PDF 的 DedocPDFLoader。该加载器支持与文档分割、表格和附件提取相同的参数。 Dedoc 可以提取带或不带文字层的 PDF，并自动检测文字层的存在和正确性。可使用多个 PDF 处理器，通过 pdf_with_text_layer 参数选择其中一个。详情请参阅参数说明。对于没有文字层的 PDF，需要安装 Tesseract OCR 及其语言包。在这种情况下，该说明会很有帮助。

from langchain_community.document_loaders import DedocPDFLoader

loader = DedocPDFLoader(
    "./example_data/layout-parser-paper.pdf", pdf_with_text_layer="true", pages="2:2"
)

docs = loader.load()

docs[0].page_content[:400]

'\n2\n\nZ. Shen et al.\n\n37], layout detection [38, 22], table detection [26], and scene text detection [4].\n\nA generalized learning-based framework dramatically reduces the need for the\n\nmanual speciﬁcation of complicated rules, which is the status quo with traditional\n\nmethods. DL has the potential to transform DIA pipelines and beneﬁt a broad\n\nspectrum of large-scale document digitization projects.\n'

Dedoc API

如果想以更少的配置快速上手，可以将 Dedoc 作为服务使用。 DedocAPIFileLoader 无需安装 dedoc 库即可使用。 该加载器支持与 DedocFileLoader 相同的参数，并能自动检测输入文件类型。要使用 DedocAPIFileLoader，需要运行 Dedoc 服务，例如 Docker 容器（详情请参阅文档）：

docker pull dedocproject/dedoc
docker run -p 1231:1231

请勿在您的代码中使用我们的演示 URL https://dedoc-readme.hf.space。

from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader(
    "./example_data/state_of_the_union.txt",
    url="https://dedoc-readme.hf.space",
)

docs = loader.load()

docs[0].page_content[:400]

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

在 GitHub 上编辑此页面或提交 Issue。

将这些文档连接到 Claude、VSCode 等工具，通过 MCP 获取实时解答。

Popular Providers

Integrations by component

概述

集成详情

加载器特性

安装配置

实例化

加载

懒加载

API 参考

加载ing any file

基本示例

分割模式

处理表格

处理附件

加载ing PDF file

Dedoc API

Popular Providers

Integrations by component

​概述

​集成详情

​加载器特性

​安装配置

​实例化

​加载

​懒加载

​API 参考

​加载ing any file

​基本示例

​分割模式

​处理表格

​处理附件

​加载ing PDF file

​Dedoc API

概述

集成详情

加载器特性

安装配置

实例化

加载

懒加载

API 参考

加载ing any file

基本示例

分割模式

处理表格

处理附件

加载ing PDF file

Dedoc API