Skip to main content
本文介绍如何使用 WebBaseLoaderHTML 网页中的所有文本加载为可在下游使用的文档格式。如需为网页加载添加更多自定义逻辑,可参考 IMSDbLoaderAZLyricsLoaderCollegeConfidentialLoader 等子类示例。 如果您不想处理网站爬取、绕过 JS 拦截站点和数据清洗等问题,可以考虑使用 FireCrawlLoader 或速度更快的 SpiderLoader

概述

集成详情

  • 待填写表格功能。
  • 如果 JS 支持链接不适用,请删除;否则确保链接正确。
  • 确保 API 参考链接正确。
本地支持可序列化JS 支持
WebBaseLoaderlangchain-community

加载器功能

来源文档懒加载原生异步支持
WebBaseLoader

安装

凭证

WebBaseLoader 不需要任何凭证。

安装依赖

要使用 WebBaseLoader,首先需要安装 langchain-community Python 包。
pip install -qU langchain-community beautifulsoup4

初始化

现在可以实例化模型对象并加载文档:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.example.com/")
要在获取时绕过 SSL 验证错误,可以设置 “verify” 选项: loader.requests_kwargs = {'verify':False}

初始化多个页面

您也可以传入一个页面列表来批量加载。
loader_multiple_pages = WebBaseLoader(
    ["https://www.example.com/", "https://google.com"]
)

加载

docs = loader.load()

docs[0]
Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')
print(docs[0].metadata)
{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}

并发加载多个 URL

通过并发抓取和解析多个 URL,可以加快抓取速度。 并发请求有合理限制,默认每秒 2 个。如果您不介意对服务器造成负担,或者您控制着被抓取的服务器,可以修改 requests_per_second 参数增加最大并发请求数。注意,这虽然能加快抓取速度,但可能导致服务器封锁您。请谨慎操作!
pip install -qU  nest_asyncio

# 修复 asyncio 和 jupyter 的 bug
import nest_asyncio

nest_asyncio.apply()
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00,  8.28it/s]
[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
 Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms  ')]

加载 XML 文件或使用其他 BeautifulSoup 解析器

您也可以参阅 SitemapLoader 了解如何加载网站地图文件,这是一个使用此功能的示例。
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs
[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n§ 431.86\nSection § 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by conducting the appropriate test procedure(s) indicated in Table 1 of this section.\n\nTable 1—Test Requirements for Commercial Packaged Boiler Equipment Classes\n\nEquipment category\nSubcategory\nCertified rated inputBtu/h\n\nStandards efficiency metric(§\u2009431.87)\n\nTest procedure(corresponding to\nstandards efficiency\nmetric required\nby §\u2009431.87)\n\n\n\nHot Water\nGas-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nGas-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nHot Water\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nOil-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nSteam\nGas-fired (all*)\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nGas-fired (all*)\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3 with Section 2.4.3.2.\n\n\n\nSteam\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nOil-fired\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3. with Section 2.4.3.2.\n\n\n\n*\u2009Equipment classes for commercial packaged boilers as of July 22, 2009 (74 FR 36355) distinguish between gas-fired natural draft and all other gas-fired (except natural draft).\n\n(c) Field Tests. The field test provisions of appendix A may be used only to test a unit of commercial packaged boiler with rated input greater than 5,000,000 Btu/h.\n[81 FR 89305, Dec. 9, 2016]\n\n\nEnergy Efficiency Standards\n\n')]

懒加载

您可以使用懒加载逐页加载,以降低内存占用。
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}

异步加载

pages = []
async for doc in loader.alazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)
Fetching pages: 100%|###########################################################################| 1/1 [00:00<00:00, 10.51it/s]
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}

使用代理

有时您可能需要使用代理来绕过 IP 封锁。可以向加载器传入一个代理字典(底层使用 requests)来使用代理。
loader = WebBaseLoader(
        "https://www.walmart.com/search?q=parrots",
        proxies={
        "http": "http://{username}:{password}:@proxy.service.com:6666/",
        "https": "https://{username}:{password}:@proxy.service.com:6666/",
    },
)
docs = loader.load()

API 参考

有关 WebBaseLoader 所有功能和配置的详细文档,请参阅 API 参考:python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html