概述
集成详情
| 类 | 包 | 可序列化 | JS 支持 | 版本 |
|---|---|---|---|---|
| SmartScraperTool | langchain-scrapegraph | ✅ | ❌ | |
| SmartCrawlerTool | langchain-scrapegraph | ✅ | ❌ | |
| MarkdownifyTool | langchain-scrapegraph | ✅ | ❌ | |
| AgenticScraperTool | langchain-scrapegraph | ✅ | ❌ | |
| GetCreditsTool | langchain-scrapegraph | ✅ | ❌ |
工具功能
| 工具 | 用途 | 输入 | 输出 |
|---|---|---|---|
| SmartScraperTool | 从网站提取结构化数据 | URL + 提示词 | JSON |
| SmartCrawlerTool | 通过爬取从多个页面提取数据 | URL + 提示词 + 爬取选项 | JSON |
| MarkdownifyTool | 将网页转换为 Markdown | URL | Markdown 文本 |
| GetCreditsTool | 查询 API 额度 | 无 | 额度信息 |
安装
该集成需要以下包:Copy
pip install --quiet -U langchain-scrapegraph
凭证
使用这些工具需要 ScrapeGraph AI API 密钥。请在 scrapegraphai.com 获取。Copy
import getpass
import os
if not os.environ.get("SGAI_API_KEY"):
os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n")
Copy
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
实例化
以下展示如何实例化 ScrapeGraph 工具:Copy
from scrapegraph_py.logger import sgai_logger
import json
from langchain_scrapegraph.tools import (
GetCreditsTool,
MarkdownifyTool,
SmartCrawlerTool,
SmartScraperTool,
)
sgai_logger.set_logging(level="INFO")
smartscraper = SmartScraperTool()
smartcrawler = SmartCrawlerTool()
markdownify = MarkdownifyTool()
credits = GetCreditsTool()
调用
直接带参数调用
让我们逐个尝试每个工具:SmartCrawler 工具
SmartCrawlerTool 允许您从网站爬取多个页面并提取结构化数据,支持深度控制、页面限制和域名限制等高级爬取选项。Copy
# SmartScraper
result = smartscraper.invoke(
{
"user_prompt": "Extract the company name and description",
"website_url": "https://scrapegraphai.com",
}
)
print("SmartScraper Result:", result)
# Markdownify
markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"})
print("\nMarkdownify Result (first 200 chars):", markdown[:200])
# SmartCrawler
url = "https://scrapegraphai.com/"
prompt = (
"What does the company do? and I need text content from their privacy and terms"
)
# Use the tool with crawling parameters
result_crawler = smartcrawler.invoke(
{
"url": url,
"prompt": prompt,
"cache_website": True,
"depth": 2,
"max_pages": 2,
"same_domain_only": True,
}
)
print("\nSmartCrawler Result:")
print(json.dumps(result_crawler, indent=2))
# Check credits
credits_info = credits.invoke({})
print("\nCredits Info:", credits_info)
Copy
SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis."}
Markdownify Result (first 200 chars): [ScrapeGraphAI](https://scrapegraphai.com/)
PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up
Op
LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': 'contact@example.com', 'phone': '(555) 123-4567'}}
Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}
Copy
# SmartCrawler example
from scrapegraph_py.logger import sgai_logger
import json
from langchain_scrapegraph.tools import SmartCrawlerTool
sgai_logger.set_logging(level="INFO")
# Will automatically get SGAI_API_KEY from environment
tool = SmartCrawlerTool()
# Example based on the provided code snippet
url = "https://scrapegraphai.com/"
prompt = (
"What does the company do? and I need text content from their privacy and terms"
)
# Use the tool with crawling parameters
result = tool.invoke(
{
"url": url,
"prompt": prompt,
"cache_website": True,
"depth": 2,
"max_pages": 2,
"same_domain_only": True,
}
)
print(json.dumps(result, indent=2))
使用 ToolCall 调用
我们也可以使用模型生成的 ToolCall 来调用工具:Copy
model_generated_tool_call = {
"args": {
"user_prompt": "Extract the main heading and description",
"website_url": "https://scrapegraphai.com",
},
"id": "1",
"name": smartscraper.name,
"type": "tool_call",
}
smartscraper.invoke(model_generated_tool_call)
Copy
ToolMessage(content='{"main_heading": "Get the data you need from any website", "description": "Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data."}', name='SmartScraper', tool_call_id='1')
链式调用
让我们将工具与 LLM 结合使用,分析一个网站:Copy
# | output: false
# | echo: false
# pip install -qU langchain langchain-openai
from langchain.chat_models import init_chat_model
model = init_chat_model(model="gpt-4.1", model_provider="openai")
Copy
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain
prompt = ChatPromptTemplate(
[
(
"system",
"You are a helpful assistant that can use tools to extract structured information from websites.",
),
("human", "{user_input}"),
("placeholder", "{messages}"),
]
)
model_with_tools = model.bind_tools([smartscraper], tool_choice=smartscraper.name)
model_chain = prompt | model_with_tools
@chain
def tool_chain(user_input: str, config: RunnableConfig):
input_ = {"user_input": user_input}
ai_msg = model_chain.invoke(input_, config=config)
tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)
return model_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)
tool_chain.invoke(
"What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com"
)
Copy
AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: contact@scrapegraphai.com.', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{"user_prompt":"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.","website_url":"https://scrapegraphai.com"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})
API 参考
有关所有 ScrapeGraph 功能和配置的详细文档,请参阅 LangChain API 参考。 或访问官方 SDK 仓库。Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

