Hugging Face 端点集成

Hugging Face Hub 是一个拥有超过 12 万个模型、2 万个数据集和 5 万个演示应用（Spaces）的平台，所有资源均为开源且公开可用，用户可以在该在线平台上轻松协作并共同构建机器学习项目。

Hugging Face Hub 还提供了多种端点用于构建机器学习应用程序。本示例展示了如何连接到不同类型的端点。特别是，文本生成推理由 Text Generation Inference 提供支持：这是一个专为极速文本生成推理而定制的 Rust、Python 和 gRPC 服务器。

from langchain_huggingface import HuggingFaceEndpoint

安装与配置

要使用它，您需要已安装 huggingface_hub Python 软件包。

pip install -qU huggingface_hub

# 获取令牌：https://huggingface.co/docs/api-inference/quicktour#get-your-api-token

from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()

import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

准备示例

from langchain_huggingface import HuggingFaceEndpoint

from langchain_classic.chains import LLMChain
from langchain_core.prompts import PromptTemplate

question = "Who won the FIFA World Cup in the year 1994? "

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

示例

以下示例展示了如何访问无服务器 Inference Providers API 的 HuggingFaceEndpoint 集成。

repo_id = "deepseek-ai/DeepSeek-R1-0528"

llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    max_length=128,
    temperature=0.5,
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
    provider="auto",  # 在此处设置您的提供商 hf.co/settings/inference-providers
    # provider="hyperbolic",
    # provider="nebius",
    # provider="together",
)
llm_chain = prompt | llm
print(llm_chain.invoke({"question": question}))

专用端点

免费的无服务器 API 可让您快速实现解决方案并进行迭代，但由于负载需与其他请求共享，在重度使用场景下可能会受到速率限制。对于企业级工作负载，最佳选择是使用 Inference Endpoints - Dedicated。这提供了对完全托管基础设施的访问权限，从而提供更高的灵活性和速度。这些资源附带持续支持和正常运行时间保证，以及自动扩缩容（AutoScaling）等选项。

# 在下方设置您的 Inference Endpoint URL
your_endpoint_url = "https://fayjubiy2xqn36z0.us-east-1.aws.endpoints.huggingface.cloud"

llm = HuggingFaceEndpoint(
    endpoint_url=f"{your_endpoint_url}",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)
llm("What did foo say about bar?")

流式传输

from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    endpoint_url=f"{your_endpoint_url}",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    streaming=True,
)
llm("What did foo say about bar?", callbacks=[StreamingStdOutCallbackHandler()])

同一个 HuggingFaceEndpoint 类也可与本地 HuggingFace TGI 实例配合使用，以提供 LLM 服务。有关各种硬件（GPU、TPU、Gaudi 等）支持的详细信息，请查看 TGI 仓库。

在 GitHub 上编辑此页面或提交问题。

连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

Popular Providers

Integrations by component

安装与配置

准备示例

示例

专用端点

流式传输

Popular Providers

Integrations by component

​安装与配置

​准备示例

​示例

​专用端点

​流式传输

安装与配置

准备示例

示例

专用端点

流式传输