Skip to main content
Apify Actors 是专为各种网页抓取、爬取和数据提取任务设计的云端程序。这些 Actor 能够自动从网络收集数据,帮助用户高效地提取、处理和存储信息。Actor 可用于执行多种任务,例如抓取电商网站的商品详情、监控价格变动或收集搜索引擎结果。它们与 Apify Datasets 无缝集成,使 Actor 收集的结构化数据能够以 JSON、CSV 或 Excel 等格式进行存储、管理和导出,便于进一步分析或使用。

概述

本 notebook 将引导您使用 Apify Actors 与 LangChain 实现网页抓取和数据提取的自动化。langchain-apify 包将 Apify 的云端工具与 LangChain agent 集成,为 AI 应用提供高效的数据收集与处理能力。

集成详情

可序列化JS 支持版本
ApifyActorsToollangchain-apifyPyPI - Version

工具特性

返回制品原生异步返回数据定价
Actor 输出(因 Actor 而异)按使用量付费,提供免费套餐

安装配置

该集成位于 langchain-apify 包中,可使用 pip 安装。
pip install langchain-apify

前置条件

  • Apify 账号:注册免费的 Apify 账号
  • Apify API token:请参阅 Apify 文档了解如何获取 API token。
import os

os.environ["APIFY_TOKEN"] = "your-apify-token"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

定价

Apify 采用按使用量付费的方式,并提供免费套餐。 不同 Actor 的定价有所不同——部分 Actor 免费(只需支付平台使用费),而其他 Actor 则按结果数量或事件收费。

实例化

以下示例实例化 ApifyActorsTool,以调用 RAG Web Browser Apify Actor。该 Actor 为 AI 和 LLM 应用提供网页浏览功能,类似于 ChatGPT 中的网页浏览特性。Apify Store 中的任意 Actor 都可以以此方式使用。
from langchain_apify import ApifyActorsTool

tool = ApifyActorsTool("apify/rag-web-browser")

调用

ApifyActorsTool 接受单个参数 run_input,该参数是一个字典,将作为运行输入传递给 Actor。运行输入的 schema 文档可在 Actor 详情页的输入部分找到。请参阅 RAG Web Browser 输入 schema
tool.invoke({"run_input": {"query": "what is apify?", "maxResults": 2}})

链式调用

我们可以将创建的工具提供给 agent。当被要求搜索信息时,agent 将调用 Apify Actor,由其搜索网络,然后返回搜索结果。
pip install langgraph langchain-openai
from langchain.messages import ToolMessage
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent


model = ChatOpenAI(model="gpt-5-mini")
tools = [tool]
graph = create_agent(model, tools=tools)
inputs = {"messages": [("user", "search for what is Apify")]}
for s in graph.stream(inputs, stream_mode="values"):
    message = s["messages"][-1]
    # skip tool messages
    if isinstance(message, ToolMessage):
        continue
    message.pretty_print()
================================ Human Message =================================

search for what is Apify
================================== Ai Message ==================================
Tool Calls:
  apify_actor_apify_rag-web-browser (call_27mjHLzDzwa5ZaHWCMH510lm)
 Call ID: call_27mjHLzDzwa5ZaHWCMH510lm
  Args:
    run_input: {"run_input":{"query":"Apify","maxResults":3,"outputFormats":["markdown"]}}
================================== Ai Message ==================================

Apify is a comprehensive platform for web scraping, browser automation, and data extraction. It offers a wide array of tools and services that cater to developers and businesses looking to extract data from websites efficiently and effectively. Here's an overview of Apify:

1. **Ecosystem and Tools**:
   - Apify provides an ecosystem where developers can build, deploy, and publish data extraction and web automation tools called Actors.
   - The platform supports various use cases such as extracting data from social media platforms, conducting automated browser-based tasks, and more.

2. **Offerings**:
   - Apify offers over 10,000 ready-made scraping tools and code templates.
   - Users can also build custom solutions or hire Apify's professional services for more tailored data extraction needs.

3. **Technology and Integration**:
   - The platform supports integration with popular tools and services like Zapier, GitHub, Google Sheets, Pinecone, and more.
   - Apify supports open-source tools and technologies such as JavaScript, Python, Puppeteer, Playwright, Selenium, and its own Crawlee library for web crawling and browser automation.

4. **Community and Learning**:
   - Apify hosts a community on Discord where developers can get help and share expertise.
   - It offers educational resources through the Web Scraping Academy to help users become proficient in data scraping and automation.

5. **Enterprise Solutions**:
   - Apify provides enterprise-grade web data extraction solutions with high reliability, 99.95% uptime, and compliance with SOC2, GDPR, and CCPA standards.

For more information, you can visit [Apify's official website](https://apify.com/) or their [GitHub page](https://github.com/apify) which contains their code repositories and further details about their projects.

更多 Actor 示例

Apify Store 包含数千个预构建的 Actor。以下是其他热门 Actor 的示例:

Instagram 抓取工具

from langchain_apify import ApifyActorsTool

instagram_tool = ApifyActorsTool("apify/instagram-scraper")

# Scrape Instagram posts
result = instagram_tool.invoke({
    "run_input": {
        "directUrls": ["https://www.instagram.com/humansofny/"],
        "resultsLimit": 10
    }
})

Google 搜索结果抓取工具

google_search_tool = ApifyActorsTool("apify/google-search-scraper")

# Scrape Google Search results
result = google_search_tool.invoke({
    "run_input": {
        "queries": "langchain python tutorial",
        "maxPagesPerQuery": 1
    }
})
浏览 Apify Store,发现更多适合您使用场景的 Actor。

何时使用 Apify

以下场景中 Apify 是理想选择:
  • 需要访问数千个预构建 Actor,涵盖各类平台(社交媒体、电商、搜索引擎等)
  • 需要自定义的网页抓取和自动化工作流,超越简单搜索
  • 无需自建基础设施的抓取(无服务器平台负责扩展和维护)
  • 灵活的 Actor 生态系统——可运行 Apify Store 中的任意 Actor

API 参考

有关如何使用此集成的更多信息,请参阅 git 仓库Apify 集成文档

使用 Apify MCP Server

不确定使用哪个 Actor 或需要哪些参数? Apify MCP(模型上下文协议)服务器可以帮助您通过模型上下文协议发现可用的 Actor、浏览其输入 schema 并了解参数要求。 在 LangChain 中使用 Apify MCP server:
import os
from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain.agents import create_agent

client = MultiServerMCPClient({
    "apify": {
        "transport": "http",
        "url": "https://mcp.apify.com",
        "headers": {
            "Authorization": f"Bearer {os.environ['APIFY_TOKEN']}",
        },
    }
})

tools = await client.get_tools()
agent = create_agent("gpt-5-mini", tools)
更多信息请参阅 LangChain MCP 文档Apify MCP server