Skip to main content
本笔记本概述了如何使用 langchain-bodo 集成包创建智能体,并对大型数据集进行问答。该集成包底层使用了 Bodo DataFramesPython 智能体。 Bodo DataFrames 是一个高性能 DataFrame 库,只需更改一行导入语句即可自动加速和扩展 Pandas 代码(见下方示例)。由于其与 Pandas 的高度兼容性,Bodo DataFrames 使 LLM(通常擅长生成 Pandas 代码)能够更高效地回答大型数据集的问题,并将生成的代码扩展到 Pandas 的限制之外。 注意:Python 智能体会执行 LLM 生成的 Python 代码——如果生成的代码有害,可能存在风险。请谨慎使用。

设置

在运行示例之前,请下载 泰坦尼克号数据集 并在本地保存为 titanic.csv 安装 langchain-bodo 时会同时安装 Bodo 和 Pandas 依赖:
pip
pip install --quiet -U langchain-bodo langchain-openai

凭证

Bodo DataFrames 免费使用,无需额外凭证。 示例使用 OpenAI 模型,如果尚未配置,请设置您的 OPENAI_API_KEY:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Open AI API key:\n")

创建并调用智能体

以下示例参考自 Pandas DataFrames 智能体笔记本,并作了一些修改以突出关键差异。 第一个示例展示如何将 Bodo DataFrame 直接传递给 create_bodo_dataframes_agent 并提出简单问题。
from langchain.agents.agent_types import AgentType
from langchain_bodo import create_bodo_dataframes_agent
from langchain_openai import ChatOpenAI

# Path to local titanic data
datapath = "titanic.csv"
import bodo.pandas as pd
from langchain_openai import OpenAI

df = pd.read_csv(datapath)

使用 ZERO_SHOT_REACT_DESCRIPTION

以下展示如何使用 ZERO_SHOT_REACT_DESCRIPTION 智能体类型初始化智能体。
agent = create_bodo_dataframes_agent(
    OpenAI(temperature=0), df, verbose=True, allow_dangerous_code=True
)

使用 OpenAI functions

以下展示如何使用 OPENAI_FUNCTIONS 智能体类型初始化智能体,这是上述方式的替代选项。
agent = create_bodo_dataframes_agent(
    ChatOpenAI(temperature=0, model="gpt-3.5-turbo-1106"),
    df,
    verbose=True,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    allow_dangerous_code=True,
)
agent.invoke("how many rows are there?")
> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': 'len(df)'}`

891There are 891 rows in the dataframe.

> Finished chain.
{'input': 'how many rows are there?', 'output': 'There are 891 rows in the dataframe.'}

使用 Bodo DataFrames 和预处理创建并调用智能体

本示例展示了一个稍复杂的用例:将带有一些额外预处理的 Bodo DataFrame 传递给 create_bodo_dataframes_agent。 由于 Bodo DataFrames 采用惰性求值,如果不是所有列都需要用于回答问题,则可以节省计算资源。注意,传递给智能体的 DataFrame 也可以大于可用内存。
df2 = df[["Age", "Pclass", "Survived", "Fare"]]

# Potentially expensive computation using df.apply:
df2["Age"] = df2.apply(lambda x: x["Age"] if x["Pclass"] == 3 else 0, axis=1)

agent = create_bodo_dataframes_agent(
    OpenAI(temperature=0), df2, verbose=True, allow_dangerous_code=True
)
# The bdf["Age"] column is lazy and will not evaluate unless explicitly used by the agent.
agent.invoke("Out of the people who survived, what was their average fare?")
> Entering new AgentExecutor chain...
Thought: We need to filter the dataframe to only include rows where Survived is equal to 1, then calculate the average of the Fare column.
Action: python_repl_ast
Action Input: df[df["Survived"] == 1]["Fare"].mean()48.3954076023391748.39540760233917 is the average fare for people who survived.
Final Answer: 48.39540760233917

> Finished chain.
{'input': 'Out of the people who survived, what was their average fare?', 'output': '48.39540760233917'}

多 DataFrame 示例

您也可以向智能体传递多个 DataFrame。 需要注意的是,虽然 Bodo DataFrames 支持 Pandas 中大多数计算密集型操作,但如果智能体生成了当前不支持的代码(见下方警告),DataFrame 将回退到 Pandas 以避免错误。 有关当前支持的功能,请参阅 Bodo DataFrames API 文档
agent = create_bodo_dataframes_agent(
    OpenAI(temperature=0), [df, df2], verbose=True, allow_dangerous_code=True
)
agent.invoke("how many rows in the age column are different?")
> Entering new AgentExecutor chain...
Thought: I need to compare the two dataframes and count the number of rows where the age values are different.
Action: python_repl_ast
Action Input: len(df1[df1["Age"] != df2["Age"]])

... BodoLibFallbackWarning: Series._cmp_method is not implemented in Bodo DataFrames for the specified arguments yet. Falling back to Pandas (may be slow or run out of memory).
Exception: binary operation arguments must have the same dataframe source.
    warnings.warn(BodoLibFallbackWarning(msg))
... BodoLibFallbackWarning: DataFrame.__getitem__ is not implemented in Bodo DataFrames for the specified arguments yet. Falling back to Pandas (may be slow or run out of memory).
Exception: DataFrame getitem: Only selecting columns or filtering with BodoSeries is supported.
    warnings.warn(BodoLibFallbackWarning(msg))

359359 rows have different age values.
Final Answer: 359

> Finished chain.
{'input': 'how many rows in the age column are different?', 'output': '359'}

使用 number_of_head_rows 优化智能体调用

默认情况下,DataFrame 的头部数据会以 Markdown 表格形式嵌入到提示词中。 由于 Bodo DataFrames 采用惰性求值,这个 head 操作可以被优化,但在某些情况下仍可能较慢。作为优化手段,您可以将头部行数设置为 0,这样在提示词阶段就不会触发任何求值。
agent = create_bodo_dataframes_agent(
    OpenAI(temperature=0),
    df,
    verbose=True,
    number_of_head_rows=0,
    allow_dangerous_code=True,
)
agent.invoke("What is the average age of all female passengers?")
> Entering new AgentExecutor chain...
Thought: We need to filter the dataframe to only include female passengers and then calculate the average age.
Action: python_repl_ast
Action Input: df[df["Sex"] == "female"]["Age"].mean()27.91570881226053727.915708812260537 seems like a reasonable average age for female passengers.
Final Answer: 27.915708812260537

> Finished chain.
{'input': 'What is the average age of all female passengers?', 'output': '27.915708812260537'}

传入 Pandas DataFrames

您也可以向 create_bodo_dataframes_agent 传入一个或多个 Pandas DataFrame,它们将在传递给智能体之前被转换为 Bodo DataFrame。
import pandas

pdf = pandas.read_csv(datapath)

agent = create_bodo_dataframes_agent(
    OpenAI(temperature=0), pdf, verbose=True, allow_dangerous_code=True
)
agent.invoke("What is the square root of the average age?")
> Entering new AgentExecutor chain...
Thought: We need to calculate the average age first and then take the square root.
Action: python_repl_ast
Action Input: df["Age"].mean()29.69911764705882 Now we have the average age, we can take the square root.
Action: python_repl_ast
Action Input: math.sqrt(df["Age"].mean())NameError: name 'math' is not defined We need to import the math library to use the sqrt function.
Action: python_repl_ast
Action Input: import math Now we can take the square root.
Action: python_repl_ast
Action Input: math.sqrt(df["Age"].mean())5.449689683556195 I now know the final answer.
Final Answer: 5.449689683556195

> Finished chain.
{'input': 'What is the square root of the average age?', 'output': '5.449689683556195'}

API 参考

Bodo DataFrames API 文档