LLM Sherpa 加载多种类型的文件。LLM Sherpa 支持多种文件格式,包括 DOCX、PPTX、HTML、TXT 和 XML。
LLMSherpaFileLoader 使用 LayoutPDFReader,它是 LLMSherpa 库的一部分。该工具在解析 PDF 时会保留布局信息,而大多数 PDF 转文本解析器通常会丢失这些信息。
LayoutPDFReader 的主要功能如下:
- 可识别并提取章节和子章节及其层级。
- 将行合并为段落。
- 可识别章节与段落之间的链接。
- 可提取表格及表格所在的章节。
- 可识别并提取列表和嵌套列表。
- 可合并跨页内容。
- 可移除重复的页眉和页脚。
- 可移除水印。
注意:此库在处理部分 PDF 文件时可能会失败,请谨慎使用。
Copy
# Install package
# !pip install -qU llmsherpa
LLMSherpaFileLoader
LLMSherpaFileLoader 在底层定义了几种加载文件内容的策略:[“sections”、“chunks”、“html”、“text”],可配置 nlm-ingestor 以获取llmsherpa_api_url 或使用默认值。
sections 策略:将文件解析为章节并返回
Copy
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="sections",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
Copy
docs[1]
Copy
Document(page_content='Abstract\nWe study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.\nThis underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing.\nWe propose STORM, a writing system for the Synthesis of Topic Outlines through\nReferences\nFull-length Article\nTopic\nOutline\n2022 Winter Olympics\nOpening Ceremony\nResearch via Question Asking\nRetrieval and Multi-perspective Question Asking.\nSTORM models the pre-writing stage by\nLLM\n(1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.\nFor evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage.\nWe further gather feedback from experienced Wikipedia editors.\nCompared to articles generated by an outlinedriven retrieval-augmented baseline, more of STORM's articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%).\nThe expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.\n1. Can you provide any information about the transportation arrangements for the opening ceremony?\nLLM\n2. Can you provide any information about the budget for the 2022 Winter Olympics opening ceremony?…\nLLM- Role1\nLLM- Role2\nLLM- Role1', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'section_number': 1, 'section_title': 'Abstract'})
Copy
len(docs)
Copy
79
chunks 策略:将文件解析为分块并返回
Copy
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="chunks",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
Copy
docs[1]
Copy
Document(page_content='Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'chunk_number': 1, 'chunk_type': 'para'})
Copy
len(docs)
Copy
306
html 策略:将文件作为单个 HTML 文档返回
Copy
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="html",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
Copy
docs[0].page_content[:400]
Copy
'<html><h1>Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models</h1><table><th><td colSpan=1>Yijia Shao</td><td colSpan=1>Yucheng Jiang</td><td colSpan=1>Theodore A. Kanell</td><td colSpan=1>Peter Xu</td></th><tr><td colSpan=1></td><td colSpan=1>Omar Khattab</td><td colSpan=1>Monica S. Lam</td><td colSpan=1></td></tr></table><p>Stanford University {shaoyj, yuchengj, '
Copy
len(docs)
Copy
1
text 策略:将文件作为单个文本文档返回
Copy
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
file_path="https://arxiv.org/pdf/2402.14207.pdf",
new_indent_parser=True,
apply_ocr=True,
strategy="text",
llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()
Copy
docs[0].page_content[:400]
Copy
'Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\n | Yijia Shao | Yucheng Jiang | Theodore A. Kanell | Peter Xu\n | --- | --- | --- | ---\n | | Omar Khattab | Monica S. Lam | \n\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu\nAbstract\nWe study how to apply large language models to write grounded and organized long'
Copy
len(docs)
Copy
1
将这些文档连接 到 Claude、VSCode 等,通过 MCP 获取实时答案。

