Skip to main content
兼容性:仅在 Node.js 上可用。
本笔记本提供了使用 RecursiveUrlLoader 文档加载器 的快速入门概述。有关所有 RecursiveUrlLoader 功能和配置的详细文档,请前往 API 参考

概述

集成详情

本地可序列化PY 支持
RecursiveUrlLoader@langchain/communitybeta

加载器功能

Web 加载器仅限 Node 环境
RecursiveUrlLoader
当从网站加载内容时,我们可能希望处理并加载页面上的所有 URL。 例如,让我们看看 LangChain.js 介绍文档。 这有许多有趣的子页面,我们可能希望加载、拆分并稍后批量检索。 挑战在于遍历子页面树并组装列表! 我们使用 RecursiveUrlLoader 来完成此操作。 这还为我们提供了排除某些子页面、自定义提取器等的灵活性。

设置

要访问 RecursiveUrlLoader 文档加载器,您需要安装 @langchain/community 集成和 jsdom 包。

凭据

如果您想自动跟踪模型调用,还可以通过取消注释以下内容来设置您的 LangSmith API 密钥:
# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"

安装

LangChain RecursiveUrlLoader 集成位于 @langchain/community 包中:
npm install @langchain/community @langchain/core jsdom
</CodeGroup>

我们还建议添加一个像 [`html-to-text`](https://www.npmjs.com/package/html-to-text) 或
[`@mozilla/readability`](https://www.npmjs.com/package/@mozilla/readability) 这样的包,用于从页面提取原始文本。

<CodeGroup>
```bash npm
npm install html-to-text
我们还建议添加一个像 html-to-text@mozilla/readability 这样的包,用于从页面提取原始文本。
npm install html-to-text
我们还建议添加一个像 html-to-text@mozilla/readability 这样的包,用于从页面提取原始文本。

实例化

现在我们可以实例化我们的模型对象并加载文档:
import { RecursiveUrlLoader } from "@langchain/community/document_loaders/web/recursive_url"
import { compile } from "html-to-text";

const compiledConvert = compile({ wordwrap: 130 }); // returns (text: string) => string;

const loader = new RecursiveUrlLoader("https://langchain.com/",  {
  extractor: compiledConvert,
  maxDepth: 1,
  excludeDirs: ["/docs/api/"],
})

加载

const docs = await loader.load()
docs[0]
{
  pageContent: '\n' +
    '/\n' +
    'Products\n' +
    '\n' +
    'LangChain [/langchain]LangSmith [/langsmith]LangGraph [/langgraph]\n' +
    'Methods\n' +
    '\n' +
    'Retrieval [/retrieval]Agents [/agents]Evaluation [/evaluation]\n' +
    'Resources\n' +
    '\n' +
    'Blog [https://blog.langchain.dev/]Case Studies [/case-studies]Use Case Inspiration [/use-cases]Experts [/experts]Changelog\n' +
    '[https://changelog.langchain.com/]\n' +
    'Docs\n' +
    '\n' +
    'LangChain Docs [https://python.langchain.com/v0.2/docs/introduction/]LangSmith Docs [https://docs.smith.langchain.com/]\n' +
    'Company\n' +
    '\n' +
    'About [/about]Careers [/careers]\n' +
    'Pricing [/pricing]\n' +
    'Get a demo [/contact-sales]\n' +
    'Sign up [https://smith.langchain.com/]\n' +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    'LangChain’s suite of products supports developers along each step of the LLM application lifecycle.\n' +
    '\n' +
    '\n' +
    'APPLICATIONS THAT CAN REASON. POWERED BY LANGCHAIN.\n' +
    '\n' +
    'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\n' +
    '\n' +
    '\n' +
    '\n' +
    'FROM STARTUPS TO GLOBAL ENTERPRISES,\n' +
    'AMBITIOUS BUILDERS CHOOSE\n' +
    'LANGCHAIN PRODUCTS.\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c22746faa78338532_logo_Ally.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c08e67bb7eefba4c2_logo_Rakuten.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c576fdde32d03c1a0_logo_Elastic.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c6d5592036dae24e5_logo_BCG.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f19528c3557c2c19c3086_the-home-depot-2%201.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7cbcf6473519b06d84_logo_IDEO.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7cb5f96dcc100ee3b7_logo_Zapier.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/6606183e52d49bc369acc76c_mdy_logo_rgb_moodysblue.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c8ad7db6ed6ec611e_logo_Adyen.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c737d50036a62768b_logo_Infor.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f59d98444a5f98aabe21c_acxiom-vector-logo-2022%201.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c09a158ffeaab0bd2_logo_Replit.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c9d2b23d292a0cab0_logo_Retool.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c44e67a3d0a996bf3_logo_Databricks.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f5a1299d6ba453c78a849_image%20(19).png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c63af578816bafcc3_logo_Instacart.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/665dc1dabc940168384d9596_podium%20logo.svg]\n' +
    '\n' +
    'Build\n' +
    '\n' +
    'LangChain is a framework to build with LLMs by chaining interoperable components. LangGraph is the framework for building\n' +
    'controllable agentic workflows.\n' +
    '\n' +
    '\n' +
    '\n' +
    'Run\n' +
    '\n' +
    'Deploy your LLM applications at scale with LangGraph Cloud, our infrastructure purpose-built for agents.\n' +
    '\n' +
    '\n' +
    '\n' +
    'Manage\n' +
    '\n' +
    "Debug, collaborate, test, and monitor your LLM app in LangSmith - whether it's built with a LangChain framework or not. \n" +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    'BUILD YOUR APP WITH LANGCHAIN\n' +
    '\n' +
    'Build context-aware, reasoning applications with LangChain’s flexible framework that leverages your company’s data and APIs.\n' +
    'Future-proof your application by making vendor optionality part of your LLM infrastructure design.\n' +
    '\n' +
    'Learn more about LangChain\n' +
    '\n' +
    '[/langchain]\n' +
    '\n' +
    '\n' +
    'RUN AT SCALE WITH LANGGRAPH CLOUD\n' +
    '\n' +
    'Deploy your LangGraph app with LangGraph Cloud for fault-tolerant scalability - including support for async background jobs,\n' +
    'built-in persistence, and distributed task queues.\n' +
    '\n' +
    'Learn more about LangGraph\n' +
    '\n' +
    '[/langgraph]\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667c6d7284e58f4743a430e6_Langgraph%20UI-home-2.webp]\n' +
    '\n' +
    '\n' +
    'MANAGE LLM PERFORMANCE WITH LANGSMITH\n' +
    '\n' +
    'Ship faster with LangSmith’s debug, test, deploy, and monitoring workflows. Don’t rely on “vibes” – add engineering rigor to your\n' +
    'LLM-development workflow, whether you’re building with LangChain or not.\n' +
    '\n' +
    'Learn more about LangSmith\n' +
    '\n' +
    '[/langsmith]\n' +
    '\n' +
    '\n' +
    'HEAR FROM OUR HAPPY CUSTOMERS\n' +
    '\n' +
    'LangChain, LangGraph, and LangSmith help teams of all sizes, across all industries - from ambitious startups to established\n' +
    'enterprises.\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308aee06d9826765c897_Retool_logo%201.png]\n' +
    '\n' +
    '“LangSmith helped us improve the accuracy and performance of Retool’s fine-tuned models. Not only did we deliver a better product\n' +
    'by iterating with LangSmith, but we’re shipping new AI features to our users in a fraction of the time it would have taken without\n' +
    'it.”\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308abdd2dbbdde5a94a1_Jamie%20Cuffe.png]\n' +
    'Jamie Cuffe\n' +
    'Head of Self-Serve and New Products\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a04d37cf7d3eb1341_Rakuten_Global_Brand_Logo.png]\n' +
    '\n' +
    '“By combining the benefits of LangSmith and standing on the shoulders of a gigantic open-source community, we’re able to identify\n' +
    'the right approaches of using LLMs in an enterprise-setting faster.”\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a8b6137d44c621cb4_Yusuke%20Kaji.png]\n' +
    'Yusuke Kaji\n' +
    'General Manager of AI\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308aea1371b447cc4af9_elastic-ar21.png]\n' +
    '\n' +
    '“Working with LangChain and LangSmith on the Elastic AI Assistant had a significant positive impact on the overall pace and\n' +
    'quality of the development and shipping experience. We couldn’t have achieved  the product experience delivered to our customers\n' +
    'without LangChain, and we couldn’t have done it at the same pace without LangSmith.”\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a4095d5a871de7479_James%20Spiteri.png]\n' +
    'James Spiteri\n' +
    'Director of Security Products\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c530539f4824b828357352_Logo_de_Fintual%201.png]\n' +
    '\n' +
    '“As soon as we heard about LangSmith, we moved our entire development stack onto it. We could have built evaluation, testing and\n' +
    'monitoring tools in house, but with LangSmith it took us 10x less time to get a 1000x better tool.”\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c53058acbff86f4c2dcee2_jose%20pena.png]\n' +
    'Jose Peña\n' +
    'Senior Manager\n' +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    'THE REFERENCE ARCHITECTURE ENTERPRISES ADOPT FOR SUCCESS.\n' +
    '\n' +
    'LangChain’s suite of products can be used independently or stacked together for multiplicative impact – guiding you through\n' +
    'building, running, and managing your LLM apps.\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/6695b116b0b60c78fd4ef462_15.07.24%20-Updated%20stack%20diagram%20-%20lightfor%20website-3.webp][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667d392696fc0bc3e17a6d04_New%20LC%20stack%20-%20light-2.webp]\n' +
    '15M+\n' +
    'Monthly Downloads\n' +
    '100K+\n' +
    'Apps Powered\n' +
    '75K+\n' +
    'GitHub Stars\n' +
    '3K+\n' +
    'Contributors\n' +
    '\n' +
    '\n' +
    'THE BIGGEST DEVELOPER COMMUNITY IN GENAI\n' +
    '\n' +
    'Learn alongside the 1M+ developers who are pushing the industry forward.\n' +
    '\n' +
    'Explore LangChain\n' +
    '\n' +
    '[/langchain]\n' +
    '\n' +
    '\n' +
    'GET STARTED WITH THE LANGSMITH PLATFORM TODAY\n' +
    '\n' +
    'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ccf12801bc39bf912a58f3_Home%20C.webp]\n' +
    '\n' +
    'Teams building with LangChain are driving operational efficiency, increasing discovery & personalization, and delivering premium\n' +
    'products that generate revenue.\n' +
    '\n' +
    'Discover Use Cases\n' +
    '\n' +
    '[/use-cases]\n' +
    '\n' +
    '\n' +
    'GET INSPIRED BY COMPANIES WHO HAVE DONE IT.\n' +
    '\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65bcd7ee85507bdf350399c3_Ally_Financial%201.svg]\n' +
    'Financial Services\n' +
    '\n' +
    '[https://blog.langchain.dev/ally-financial-collaborates-with-langchain-to-deliver-critical-coding-module-to-mask-personal-identifying-information-in-a-compliant-and-safe-manner/]\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65bcd8b3ae4dc901daa3037a_Adyen_Corporate_Logo%201.svg]\n' +
    'FinTech\n' +
    '\n' +
    '[https://blog.langchain.dev/llms-accelerate-adyens-support-team-through-smart-ticket-routing-and-support-agent-copilot/]\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c534b3fa387379c0f4ebff_elastic-ar21%20(1).png]\n' +
    'Technology\n' +
    '\n' +
    '[https://blog.langchain.dev/langchain-partners-with-elastic-to-launch-the-elastic-ai-assistant/]\n' +
    '\n' +
    '\n' +
    'LANGSMITH IS THE ENTERPRISE DEVOPS PLATFORM BUILT FOR LLMS.\n' +
    '\n' +
    'Explore LangSmith\n' +
    '\n' +
    '[/langsmith]\n' +
    'Gain visibility to make trade offs between cost, latency, and quality.\n' +
    'Increase developer productivity.\n' +
    'Eliminate manual, error-prone testing.\n' +
    'Reduce hallucinations and improve reliability.\n' +
    'Enterprise deployment options to keep data secure.\n' +
    '\n' +
    '\n' +
    'READY TO START SHIPPING 
RELIABLE GENAI APPS FASTER?\n' +
    '\n' +
    'Get started with LangChain, LangGraph, and LangSmith to enhance your LLM app development, from prototype to production.\n' +
    '\n' +
    'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\n' +
    'Products\n' +
    'LangChain [/langchain]LangSmith [/langsmith]LangGraph [/langgraph]Agents [/agents]Evaluation [/evaluation]Retrieval [/retrieval]\n' +
    'Resources\n' +
    'Python Docs [https://python.langchain.com/]JS/TS Docs [https://js.langchain.com/docs/get_started/introduction/]GitHub\n' +
    '[https://github.com/langchain-ai]Integrations [https://python.langchain.com/v0.2/docs/integrations/platforms/]Templates\n' +
    '[https://templates.langchain.com/]Changelog [https://changelog.langchain.com/]LangSmith Trust Portal\n' +
    '[https://trust.langchain.com/]\n' +
    'Company\n' +
    'About [/about]Blog [https://blog.langchain.dev/]Twitter [https://twitter.com/LangChain]LinkedIn\n' +
    '[https://www.linkedin.com/company/langchain/]YouTube [https://www.youtube.com/@LangChain]Community [/join-community]Marketing\n' +
    'Assets [https://drive.google.com/drive/folders/17xybjzmVBdsQA-VxouuGLxF6bDsHDe80?usp=sharing]\n' +
    'Sign up for our newsletter to stay up to date\n' +
    'Thank you! Your submission has been received!\n' +
    'Oops! Something went wrong while submitting the form.\n' +
    '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c6a38f9c53ec71f5fc73de_langchain-word.svg]\n' +
    'All systems operational\n' +
    '[https://status.smith.langchain.com/]Privacy Policy [/'... 111 more characters,
  metadata: {
    source: 'https://langchain.com/',
    title: 'LangChain',
    description: 'LangChain’s suite of products supports developers along each step of their development journey.',
    language: 'en'
  }
}
console.log(docs[0].metadata)
{
  source: 'https://langchain.com/',
  title: 'LangChain',
  description: 'LangChain's suite of products supports developers along each step of their development journey.',
  language: 'en'
}

选项

interface Options {
  excludeDirs?: string[]; // 要排除的网页目录。
  extractor?: (text: string) => string; // 一个从网页提取文档文本的函数,默认返回页面原样。建议使用 html-to-text 等工具提取文本。默认情况下,它只返回页面原样。
  maxDepth?: number; // 爬取的最大深度。默认设置为 2。如果需要爬取整个网站,请将其设置为足够大的数字即可。
  timeout?: number; // 每个请求的超时时间,单位为秒。默认设置为 10000(10 秒)。
  preventOutside?: boolean; // 是否防止爬取根 URL 之外的内容。默认设置为 true。
  callerOptions?: AsyncCallerConstructorParams; // 调用 AsyncCaller 的选项,例如设置最大并发数(默认为 64)。
}
但是,由于很难执行完美的过滤,您可能仍然会在结果中看到一些不相关的结果。如果需要,您可以自己对返回的文档执行过滤。大多数情况下,返回的结果已经足够好。

API 参考

有关所有 RecursiveUrlLoader 功能和配置的详细文档,请前往 API 参考