Amazon Textract 集成

Amazon Textract 是一项机器学习 (ML) 服务，可自动从扫描的文档中提取文本、手写内容和数据。它超越了简单的光学字符识别 (OCR)，能够识别、理解并从表单和表格中提取数据。如今，许多公司手动从扫描的文档（如 PDF、图像、表格和表单）中提取数据，或者使用需要手动配置的简单 OCR 软件（当表单更改时通常必须更新）。为了克服这些手动且昂贵的过程，Textract 使用 ML 来读取和处理任何类型的文档，无需人工操作即可准确提取文本、手写内容、表格和其他数据。

Textract 支持 JPEG、PNG、PDF 和 TIFF 文件格式；更多信息可在文档中找到。以下示例展示了如何将 Amazon Textract 与 LangChain 结合作为 DocumentLoader 使用。

pip install -qU  boto3 langchain-openai tiktoken python-dotenv

pip install -qU  "amazon-textract-caller>=0.2.0"

示例 1：从本地文件加载

第一个示例使用本地文件，该文件内部将被发送到 Amazon Textract 同步 API DetectDocumentText。本地文件或 HTTP:// 等 URL 端点对于 Textract 仅限于单页文档。多页文档必须存储在 S3 上。此示例文件是 jpeg 格式。

from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()

文件输出

documents

[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]

示例 2：从 URL 加载

下一个示例从 HTTPS 端点加载文件。它必须是单页的，因为 Amazon Textract 要求所有多页文档都存储在 S3 上。

from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader(
    "https://amazon-textract-public-content.s3.us-east-2.amazonaws.com/langchain/alejandro_rosalez_sample_1.jpg"
)
documents = loader.load()

documents

[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]

示例 3：加载多页 PDF 文档

处理多页文档要求文档位于 S3 上。示例文档位于 us-east-2 区域的存储桶中，并且 Textract 需要在同一区域调用才能成功，因此我们在客户端上设置 region_name 并将其传递给加载器，以确保从 us-east-2 调用 Textract。您也可以让您的笔记本在 us-east-2 中运行，将 AWS_DEFAULT_REGION 设置为 us-east-2，或者在不同环境中运行时，传递一个带有该区域名称的 boto3 Textract 客户端，如下方单元格所示。

import boto3

textract_client = boto3.client("textract", region_name="us-east-2")

file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()

现在获取页数以验证响应（打印完整响应会相当长…）。我们预期有 16 页。

len(documents)

示例 4：自定义输出格式

当 Amazon Textract 处理 PDF 时，它会提取所有文本，包括页眉、页脚和页码等元素。这些额外信息可能“嘈杂”，并降低输出的有效性。将文档的二维布局转换为干净的一维文本字符串的过程称为线性化。 AmazonTextractPDFLoader 通过 linearization_config 参数让您精确控制此过程。您可以使用它来指定要从最终输出中排除哪些元素。以下示例展示了如何隐藏页眉、页脚和图形，从而生成更干净的文本块，更多高级用例请参阅此 AWS 博客文章。

from langchain_community.document_loaders import AmazonTextractPDFLoader
from textractor.data.text_linearization_config import TextLinearizationConfig

loader = AmazonTextractPDFLoader(
    "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf",
    linearization_config=TextLinearizationConfig(
        hide_header_layout=True,
        hide_footer_layout=True,
        hide_figure_layout=True,
    ),
)
documents = loader.load()

在 LangChain 链中使用 AmazonTextractPDFLoader（例如 OpenAI）

AmazonTextractPDFLoader 可以像其他加载器一样在链中使用。 Textract 本身确实有一个 Query 功能，它提供了与此示例中 QA 链类似的功能，也值得一看。

# 您也可以将 OPENAI_API_KEY 存储在 .env 文件中
# import os
# from dotenv import load_dotenv

# load_dotenv()

# 或者直接在环境中设置 OpenAI 密钥
import os

os.environ["OPENAI_API_KEY"] = "your-OpenAI-API-key"

from langchain_classic.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
query = ["Who are the authors?"]

chain.run(input_documents=documents, question=query)

' The authors are Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li, Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L., Lukasz Garncarek, Powalski, R., Stanislawek, T., Topolski, B., Halama, P., Gralinski, F., Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., Harley, A.W., Ufkes, A., Derpanis, K.G., He, K., Gkioxari, G., Dollár, P., Girshick, R., He, K., Zhang, X., Ren, S., Sun, J., Kay, A., Lamiroy, B., Lopresti, D., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K., Li, M., Cui, L., Huang,'

将这些文档通过 MCP 连接到 Claude、VSCode 等，以获取实时答案。

在 GitHub 上编辑此页面或提交问题。

Popular Providers

Integrations by component

示例 1：从本地文件加载

示例 2：从 URL 加载

示例 3：加载多页 PDF 文档

示例 4：自定义输出格式

在 LangChain 链中使用 AmazonTextractPDFLoader（例如 OpenAI）

​示例 1：从本地文件加载

​示例 2：从 URL 加载

​示例 3：加载多页 PDF 文档

​示例 4：自定义输出格式

​在 LangChain 链中使用 AmazonTextractPDFLoader（例如 OpenAI）

示例 1：从本地文件加载

示例 2：从 URL 加载

示例 3：加载多页 PDF 文档

示例 4：自定义输出格式

在 LangChain 链中使用 AmazonTextractPDFLoader（例如 OpenAI）