Skip to main content
PaddleOCR 是百度开发的一款强大且轻量的 OCR 工具包,能够将图像和 PDF 与大型语言模型连接起来。它支持 100 多种语言,并将文档内容转化为结构化的、AI 就绪的数据。 本集成通过 PaddleOCRVLLoader 文档加载器提供 PaddleOCR 大模型文档解析能力。

概述

集成详情

本地可序列化JS 支持
PaddleOCRVLLoaderlangchain-paddleocr

加载器功能

来源文档惰性加载原生异步支持
PaddleOCRVLLoader
PaddleOCRVLLoader 支持以下功能:
  • 使用百度 PaddleOCR-VL 系列模型(如 PaddleOCR-VL、PaddleOCR-VL-1.5)从 PDF 和图像文件中提取文本和布局信息
  • 处理来自本地文件或远程 URL 的文档

前提条件

使用 PaddleOCR-VL 加载器需要:
  1. API 访问权限:访问 PaddleOCR-VL API 端点
  2. 身份验证:API 的访问令牌(可直接提供或通过 PADDLEOCR_ACCESS_TOKEN 环境变量提供)
API URL 和访问令牌均可在 PaddleOCR 官网 获取。只需点击 API 按钮,从提供的 API 调用示例中复制 URL 和令牌即可。

安装

pip install langchain-paddleocr

初始化

基本初始化需要 API 端点 URL 和文件路径:
from langchain_paddleocr import PaddleOCRVLLoader
from pydantic import SecretStr

loader = PaddleOCRVLLoader(
    file_path="path/to/document.pdf",
    api_url="your-api-endpoint",
    access_token=SecretStr("your-access-token")  # Optional if using environment variable
)
通过环境变量进行身份验证:
export PADDLEOCR_ACCESS_TOKEN="your-access-token"
然后不带 access_token 参数进行初始化:
loader = PaddleOCRVLLoader(
    file_path="path/to/document.pdf",
    api_url="your-api-endpoint"
)

高级配置

加载器支持多种配置选项以微调文档处理:
loader = PaddleOCRVLLoader(
    file_path=["document1.pdf", "document2.jpg"],  # Multiple files
    api_url="your-api-endpoint",

    access_token=None,  # Optional: SecretStr for API authentication
    file_type="pdf",  # Optional: "pdf" or "image", or None for auto-detection

    use_doc_orientation_classify=False,  # Enable document orientation classification
    use_doc_unwarping=False,  # Enable document unwarping
    use_layout_detection=None,  # Enable layout detection (None = use service default)
    use_chart_recognition=None,  # Enable chart recognition (None = use service default)
    use_seal_recognition=None,  # Enable seal recognition (None = use service default)
    use_ocr_for_image_block=None,  # Run OCR on image blocks (None = use service default)

    layout_threshold=None,  # Detection threshold (None = use service default)
    layout_nms=None,  # Apply non-maximum suppression (None = use service default)
    layout_unclip_ratio=None,  # Layout unclip ratio (None = use service default)
    layout_merge_bboxes_mode=None,  # Mode for merging layout bounding boxes (None = use service default)
    layout_shape_mode=None,  # Layout shape mode (None = use service default)

    prompt_label=None,  # Prompt label for VLM (None = use service default)
    format_block_content=None,  # Format block content (None = use service default)
    repetition_penalty=None,  # Repetition penalty for VLM sampling (None = use service default)
    temperature=None,  # Temperature for VLM sampling (None = use service default)
    top_p=None,  # Top-p sampling value for VLM (None = use service default)
    min_pixels=None,  # Minimum pixels allowed in preprocessing (None = use service default)
    max_pixels=None,  # Maximum pixels allowed in preprocessing (None = use service default)
    max_new_tokens=None,  # Maximum tokens generated by VLM (None = use service default)

    merge_layout_blocks=None,  # Merge layout blocks across columns (None = use service default)
    markdown_ignore_labels=None,  # Layout labels to ignore in Markdown (None = use service default)
    vlm_extra_args=None,  # Additional VLM configuration parameters (None = use service default)

    prettify_markdown=None,  # Prettify Markdown output (None = use service default)
    show_formula_number=None,  # Include formula numbers in Markdown (None = use service default)
    restructure_pages=None,  # Restructure results across pages (None = use service default)
    merge_tables=None,  # Merge tables across pages (None = use service default)
    relevel_titles=None,  # Relevel titles (None = use service default)
    visualize=None,  # Include visualization results (None = use service default)

    additional_params=None,  # Additional API parameters
    timeout=300,  # Request timeout in seconds
)

基本用法

加载文档

# Load a single document
loader = PaddleOCRVLLoader(
    file_path="https://arxiv.org/pdf/2408.09869",
    api_url="your-api-endpoint"
)
docs = loader.load()

# Inspect the results
for doc in docs[:2]:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Source: {doc.metadata['source']}")
    print("---")

处理多种文件类型

加载器根据扩展名自动检测文件类型:
# Mixed file types - auto-detected
files = [
    "document.pdf",      # PDF file
    "image.jpg",         # Image file
    "https://example.com/report.pdf"  # Remote PDF
]

loader = PaddleOCRVLLoader(file_path=files, api_url="your-api-endpoint")
支持的图像格式:.jpg.jpeg.png.bmp.tiff.tif.webp 支持的文档格式:.pdf

高级功能

访问原始 API 响应

加载器在文档元数据中包含完整的 API 响应:
docs = loader.load()
first_doc = docs[0]

# Access raw API response for advanced processing
raw_response = first_doc.metadata["paddleocr_vl_raw_response"]
print(f"Layout results: {len(raw_response['result']['layoutParsingResults'])}")

错误处理

加载器提供详细的错误信息以便排查问题:
try:
    docs = loader.load()
except ValueError as e:
    print(f"Processing failed: {e}")
    # Common issues: invalid API endpoint, authentication errors, unsupported file types

最佳实践

错误处理

  • 网络超时:对大型文档设置适当的 timeout 参数
  • 身份验证:使用环境变量安全管理令牌
  • 文件验证:处理前验证文件可访问性

故障排除

常见问题

  1. 身份验证错误:确保已设置 PADDLEOCR_ACCESS_TOKEN 或提供了 access_token
  2. 文件类型错误:验证文件扩展名和可访问性
  3. API 连接问题:检查端点 URL 和网络连接

调试模式

如需详细调试,请检查原始 API 响应:
docs = loader.load()
if docs:
    raw_response = docs[0].metadata.get("paddleocr_vl_raw_response")
    print("API Response structure:", raw_response.keys())

API 参考