llama.cpp python 库是 @ggerganov llama.cpp 的简单 Python 绑定。此包提供：

通过 ctypes 接口低级访问 C API。

用于文本完成的高级 Python API

OpenAI 风格 API

LangChain 兼容性

LlamaIndex 兼容性

兼容 OpenAI 的 Web 服务器

本地 Copilot 替代品

支持 Function Calling

支持 Vision API

多模型支持

概述

集成详情

类	包	可序列化	JavaScript 支持
`ChatLlamaCpp`	`langchain-community`	❌	❌

模型功能

工具调用	结构化输出	图像输入	音频输入	视频输入	Token 级流式传输	原生异步	Token 使用情况	Logprobs
✅	✅	❌	❌	❌	✅	❌	❌	✅

设置

为了开始并使用下面展示的所有功能，我们建议使用经过微调以支持工具调用的模型。我们将使用 NousResearch 的 Hermes-2-Pro-Llama-3-8B-GGUF。

Hermes 2 Pro 是 Nous Hermes 2 的升级版本，包含更新和清理过的 OpenHermes 2.5 数据集，以及内部新开发的 Function Calling 和 JSON Mode 数据集。这个新版本的 Hermes 保持了其出色的通用任务和对话能力——而且在 Function Calling 方面表现出色。

查看我们的本地模型指南以深入了解：

安装

LangChain LlamaCpp 集成位于 langchain-community 和 llama-cpp-python 包中：

pip install -qU langchain-community llama-cpp-python

实例化

现在我们可以实例化模型对象并生成聊天完成：

# 模型权重路径
local_model = "local/path/to/Hermes-2-Pro-Llama-3-8B-Q8_0.gguf"

import multiprocessing

from langchain_community.chat_models import ChatLlamaCpp

llm = ChatLlamaCpp(
    temperature=0.5,
    model_path=local_model,
    n_ctx=10000,
    n_gpu_layers=8,
    n_batch=300,  # 应在 1 和 n_ctx 之间，考虑 GPU 中的显存数量。
    max_tokens=512,
    n_threads=multiprocessing.cpu_count() - 1,
    repeat_penalty=1.5,
    top_p=0.5,
    verbose=True,
)

调用

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]

ai_msg = llm.invoke(messages)
ai_msg

print(ai_msg.content)

J'aime programmer. (在法国，"programming" 通常以其原本的含义使用，即安排或组织事件。)

如果你指的是计算机编程：
Je suis amoureux de la programmation informatique.

(你也可以简单地说 'programmation'，根据上下文，这两种含义都能被理解)。

工具调用

首先，它的工作方式与 OpenAI Function Calling 基本相同。 OpenAI 有一个 tool calling（我们在此互换使用“工具调用”和“函数调用”）API，它允许你描述工具及其参数，并让模型返回一个包含要调用的工具及该工具输入的 JSON 对象。工具调用对于构建工具使用链和代理非常有用，并且通常用于从模型获取结构化输出。使用 ChatLlamaCpp.bind_tools，我们可以轻松地将 Pydantic 类、dict 模式、LangChain 工具甚至函数作为工具传递给模型。在底层，这些被转换为 OpenAI 工具模式，如下所示：

{
    "name": "...",
    "description": "...",
    "parameters": {...}  # JSONSchema
}

并在每次模型调用时传递。但是，它无法自动触发函数/工具，我们需要通过指定 ‘tool choice’ 参数来强制它。此参数的格式通常如下所述。 {"type": "function", "function": {"name": <<tool_name>>}}.

from langchain.tools import tool
from pydantic import BaseModel, Field


class WeatherInput(BaseModel):
        location: str = Field(description="城市和州，例如 San Francisco, CA")
        unit: str = Field(enum=["celsius", "fahrenheit"])


@tool("get_current_weather", args_schema=WeatherInput)
def get_weather(location: str, unit: str):
    """获取给定位置的当前天气"""
    return f"Now the weather in {location} is 22 {unit}"


llm_with_tools = llm.bind_tools(
        tools=[get_weather],
        tool_choice={"type": "function", "function": {"name": "get_current_weather"}},
)

ai_msg = llm_with_tools.invoke(
    "what is the weather like in HCMC in celsius",
)

ai_msg.tool_calls

[{'name': 'get_current_weather',
  'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},
  'id': 'call__0_get_current_weather_cmpl-394d9943-0a1f-425b-8139-d2826c1431f2'}]

class MagicFunctionInput(BaseModel):
        magic_function_input: int = Field(description="魔法函数的输入值")


@tool("get_magic_function", args_schema=MagicFunctionInput)
def magic_function(magic_function_input: int):
    """获取输入的魔法函数值。"""
    return magic_function_input + 2


llm_with_tools = llm.bind_tools(
        tools=[magic_function],
        tool_choice={"type": "function", "function": {"name": "get_magic_function"}},
)

ai_msg = llm_with_tools.invoke(
    "What is magic function of 3?",
)

ai_msg

ai_msg.tool_calls

[{'name': 'get_magic_function',
  'args': {'magic_function_input': 3},
  'id': 'call__0_get_magic_function_cmpl-cd83a994-b820-4428-957c-48076c68335a'}]

结构化输出

from langchain_core.utils.function_calling import convert_to_openai_tool
from pydantic import BaseModel


class Joke(BaseModel):
    """笑话的铺垫和笑点。"""

    setup: str
    punchline: str


dict_schema = convert_to_openai_tool(Joke)
structured_llm = llm.with_structured_output(dict_schema)
result = structured_llm.invoke("Tell me a joke about birds")
result

result

{'setup': '- 为什么鸡要穿过操场？',
 'punchline': '\n\n- 为了到达另一边的金笼子！'}

流式传输

for chunk in llm.stream("what is 25x5"):
        print(chunk.content, end="\n", flush=True)

在 GitHub 上编辑此页面或提交问题。

连接这些文档到 Claude, VSCode 等 via MCP 以获得实时答案。

Popular Providers

Integrations by component

Llama.cpp 集成

概述

集成详情

模型功能

设置

安装

实例化

调用

工具调用

结构化输出

流式传输

Popular Providers

Integrations by component

​概述

​集成详情

​模型功能

​设置

​安装

​实例化

​调用

​工具调用

​结构化输出

​流式传输

概述

集成详情

模型功能

设置

安装

实例化

调用

工具调用

结构化输出

流式传输