Skip to main content
RecursiveUrlLoader 允许您从根URL递归抓取所有子链接,并将其解析为文档。

概述

集成详情

本地可序列化JS支持
RecursiveUrlLoaderlangchain-community

加载器特性

来源文档惰性加载原生异步支持
RecursiveUrlLoader

设置

凭证

使用 RecursiveUrlLoader 不需要任何凭证。

安装

RecursiveUrlLoader 位于 langchain-community 包中。没有其他必需的包,不过如果您也安装了 beautifulsoup4,您将获得更丰富的默认文档元数据。
pip install -qU langchain-community beautifulsoup4 lxml

实例化

现在我们可以实例化文档加载器对象并加载文档:
from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
    # max_depth=2,
    # use_async=False,
    # extractor=None,
    # metadata_extractor=None,
    # exclude_dirs=(),
    # timeout=10,
    # check_response_status=True,
    # continue_on_failure=True,
    # prevent_outside=True,
    # base_url=None,
    # ...
)

加载

使用 .load() 同步将所有文档加载到内存中,每个访问的URL对应一个文档。从初始URL开始,我们递归遍历所有链接的URL,直到达到指定的 max_depth。 让我们通过一个基本示例来了解如何在 Python 3.9 文档 上使用 RecursiveUrlLoader
docs = loader.load()
docs[0].metadata
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  k = self.parse_starttag(i)
{'source': 'https://docs.python.org/3.9/',
 'content_type': 'text/html',
 'title': '3.9.19 Documentation',
 'language': None}
很好!第一个文档看起来就是我们开始的根页面。让我们看看下一个文档的元数据
docs[1].metadata
{'source': 'https://docs.python.org/3.9/using/index.html',
 'content_type': 'text/html',
 'title': 'Python Setup and Usage — Python 3.9.19 documentation',
 'language': None}
这个URL看起来是我们根页面的子页面,这很棒!现在让我们从元数据转向查看其中一个文档的内容
print(docs[0].page_content[:300])
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">

    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel=
这确实看起来像是来自URL docs.python.org/3.9/ 的HTML,正如我们所预期的。现在让我们看看一些可以对基本示例进行的变体,这些变体在不同情况下可能很有帮助。

惰性加载

如果我们正在加载大量文档,并且我们的下游操作可以在所有已加载文档的子集上完成,我们可以一次惰性加载一个文档,以最小化内存占用:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # 执行一些分页操作,例如
        # index.upsert(page)

        pages = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  soup = BeautifulSoup(html, "lxml")
在这个示例中,我们一次加载到内存中的文档从未超过10个。

添加提取器

默认情况下,加载器将每个链接的原始HTML设置为文档页面内容。要将此HTML解析为更易于人类/LLM理解的格式,您可以传入一个自定义的 extractor 方法:
import re

from bs4 import BeautifulSoup


def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()


loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  k = self.parse_starttag(i)
3.9.19 Documentation

Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit
这看起来好多了! 您同样可以传入一个 metadata_extractor 来自定义如何从HTTP响应中提取文档元数据。有关更多信息,请参阅 API参考

API参考

这些示例仅展示了修改默认 RecursiveUrlLoader 的几种方式,但还有更多修改可以进行,以最好地适应您的用例。使用参数 link_regexexclude_dirs 可以帮助您过滤掉不需要的URL,aload()alazy_load() 可用于异步加载,等等。 有关配置和调用 RecursiveUrlLoader 的详细信息,请参阅 API参考