Skip to main content
RecursiveUrlLoader 允许您从根 URL 递归抓取所有子链接,并将其解析为 Document 对象。

概述

集成详情

本地可序列化JS 支持
RecursiveUrlLoaderlangchain-community

加载器特性

来源文档懒加载原生异步支持
RecursiveUrlLoader

设置

凭据

使用 RecursiveUrlLoader 无需任何凭据。

安装

RecursiveUrlLoader 位于 langchain-community 包中。无需其他必要包,但如果安装了 beautifulsoup4,则可获得更丰富的默认文档元数据。
pip install -qU langchain-community beautifulsoup4 lxml

实例化

现在我们可以实例化文档加载器对象并加载文档:
from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
    # max_depth=2,
    # use_async=False,
    # extractor=None,
    # metadata_extractor=None,
    # exclude_dirs=(),
    # timeout=10,
    # check_response_status=True,
    # continue_on_failure=True,
    # prevent_outside=True,
    # base_url=None,
    # ...
)

加载

使用 .load() 同步将所有 Document 加载到内存中,每个访问的 URL 对应一个 Document。从初始 URL 开始,递归遍历所有链接的 URL,直到达到指定的 max_depth。 让我们通过一个基本示例,演示如何在 Python 3.9 文档 上使用 RecursiveUrlLoader
docs = loader.load()
docs[0].metadata
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  k = self.parse_starttag(i)
{'source': 'https://docs.python.org/3.9/',
 'content_type': 'text/html',
 'title': '3.9.19 Documentation',
 'language': None}
第一个文档看起来是我们起始的根页面。让我们查看下一个文档的元数据:
docs[1].metadata
{'source': 'https://docs.python.org/3.9/using/index.html',
 'content_type': 'text/html',
 'title': 'Python Setup and Usage — Python 3.9.19 documentation',
 'language': None}
这个 URL 看起来是根页面的子链接,非常好!让我们从元数据继续查看其中一个文档的内容:
print(docs[0].page_content[:300])
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">

    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel=
这确实是来自 docs.python.org/3.9/ 的 HTML,符合预期。接下来让我们看看在不同场景下可以做出哪些有用的变体。

懒加载

如果我们要加载大量 Document,且下游操作可以分批处理,可以逐一懒加载文档,以最小化内存占用:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  soup = BeautifulSoup(html, "lxml")
在此示例中,内存中同时加载的 Document 不超过 10 个。

添加提取器

默认情况下,加载器将每个链接的原始 HTML 设置为 Document 的页面内容。若要将此 HTML 解析为更适合人类/LLM 阅读的格式,可以传入自定义的 extractor 方法:
import re

from bs4 import BeautifulSoup


def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()


loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  k = self.parse_starttag(i)
3.9.19 Documentation

Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit
看起来好多了! 您同样可以传入 metadata_extractor 来自定义从 HTTP 响应中提取 Document 元数据的方式。更多信息请参阅 API 参考

API 参考

以上示例仅展示了几种修改默认 RecursiveUrlLoader 的方式,但还有更多配置可以更好地满足您的使用场景。使用 link_regexexclude_dirs 参数可以过滤掉不需要的 URL;aload()alazy_load() 可用于异步加载等。 有关配置和调用 RecursiveUrlLoader 的详细信息,请参阅 API 参考:python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html