SitemapLoader 集成 - Docs by LangChain

SitemapLoader 继承自 WebBaseLoader，它从给定的 URL 加载站点地图，然后抓取并加载站点地图中的所有页面，将每个页面作为一个 Document 返回。抓取过程是并发进行的。并发请求有合理的限制，默认为每秒 2 次。如果你不担心成为“好公民”，或者你控制着被抓取的服务器，或者不关心负载，你可以增加这个限制。请注意，虽然这会加快抓取过程，但可能会导致服务器阻止你。请小心！

概述

集成详情

类	包	本地	可序列化	JS 支持
`SiteMapLoader`	`langchain-community`	✅	❌	✅

加载器特性

来源	文档延迟加载	原生异步支持
`SiteMapLoader`	✅	❌

设置

要访问 SiteMap 文档加载器，你需要安装 langchain-community 集成包。

凭证

运行此功能不需要任何凭证。要启用对模型调用的自动跟踪，请设置你的 LangSmith API 密钥：

os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

安装

安装 langchain-community。

pip install -qU langchain-community

修复 notebook asyncio 错误

import nest_asyncio

nest_asyncio.apply()

初始化

现在我们可以实例化模型对象并加载文档：

from langchain_community.document_loaders.sitemap import SitemapLoader

sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")

加载

docs = sitemap_loader.load()
docs[0]

Fetching pages: 100%|##########| 28/28 [00:04<00:00,  6.83it/s]

Document(metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')

print(docs[0].metadata)

{'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}

你可以更改 requests_per_second 参数以增加最大并发请求数。并使用 requests_kwargs 在发送请求时传递关键字参数。

sitemap_loader.requests_per_second = 2
# 可选：避免 `[SSL: CERTIFICATE_VERIFY_FAILED]` 问题
sitemap_loader.requests_kwargs = {"verify": False}

延迟加载

你也可以延迟加载页面以最小化内存负载。

page = []
for doc in sitemap_loader.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # 执行一些分页操作，例如
        # index.upsert(page)

        page = []

Fetching pages: 100%|##########| 28/28 [00:01<00:00, 19.06it/s]

过滤站点地图 URL

站点地图可能是巨大的文件，包含数千个 URL。通常你不需要其中的每一个。你可以通过向 filter_urls 参数传递字符串列表或正则表达式模式来过滤 URL。只有匹配其中一个模式的 URL 才会被加载。

loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()

documents[0]

Document(page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})

添加自定义抓取规则

SitemapLoader 在抓取过程中使用 beautifulsoup4，默认情况下会抓取页面上的每个元素。SitemapLoader 构造函数接受一个自定义抓取函数。此功能有助于根据你的特定需求定制抓取过程；例如，你可能希望避免抓取标题或导航元素。以下示例展示了如何开发和使用自定义函数来避免导航和标题元素。导入 beautifulsoup4 库并定义自定义函数。

pip install beautifulsoup4

from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # 在 BeautifulSoup 对象中查找所有 'nav' 和 'header' 元素
    nav_elements = content.find_all("nav")
    header_elements = content.find_all("header")

    # 从 BeautifulSoup 对象中移除每个 'nav' 和 'header' 元素
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())

将你的自定义函数添加到 SitemapLoader 对象。

loader = SitemapLoader(
    "https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements,
)

本地站点地图

站点地图加载器也可用于加载本地文件。

sitemap_loader = SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)

docs = sitemap_loader.load()

API 参考

有关所有 SiteMapLoader 功能和配置的详细文档，请访问 API 参考

将这些文档连接到 Claude、VSCode 等，通过 MCP 获取实时答案。

在 GitHub 上编辑此页面或提交问题。

​概述

​集成详情

​加载器特性

​设置

​凭证

​安装

​修复 notebook asyncio 错误

​初始化

​加载

​延迟加载

​过滤站点地图 URL

​添加自定义抓取规则

​本地站点地图

​API 参考

概述

集成详情

加载器特性

设置

凭证

安装

修复 notebook asyncio 错误

初始化

加载

延迟加载

过滤站点地图 URL

添加自定义抓取规则

本地站点地图

API 参考