Facebook Messenger 集成

本 notebook 演示如何将 Facebook 数据加载为可供微调使用的格式。整体步骤如下：

将您的 Messenger 数据下载到磁盘。
创建 Chat Loader 并调用 loader.load()（或 loader.lazy_load()）执行转换。
可选地使用 merge_chat_runs 合并同一发送者连续发送的消息，和/或使用 map_ai_messages 将指定发送者的消息转换为”AIMessage”类。完成后，调用 convert_messages_for_finetuning 准备微调数据。

完成上述步骤后，您可以对模型进行微调。具体步骤如下：

将您的消息上传到 OpenAI 并运行微调任务。
在您的 LangChain 应用中使用生成的模型！

让我们开始吧。

1. 下载数据

要下载您自己的 Messenger 数据，请按照此处的说明操作。重要提示 —— 请确保以 JSON 格式（而非 HTML）下载。我们在此 Google Drive 链接托管了一个示例转储文件，本演示将使用该文件。

# This uses some example data
import zipfile

import requests


def download_and_unzip(url: str, output_path: str = "file.zip") -> None:
    file_id = url.split("/")[-2]
    download_url = f"https://drive.google.com/uc?export=download&id={file_id}"

    response = requests.get(download_url)
    if response.status_code != 200:
        print("Failed to download the file.")
        return

    with open(output_path, "wb") as file:
        file.write(response.content)
        print(f"File {output_path} downloaded.")

    with zipfile.ZipFile(output_path, "r") as zip_ref:
        zip_ref.extractall()
        print(f"File {output_path} has been unzipped.")


# URL of the file to download
url = (
    "https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing"
)

# Download and unzip
download_and_unzip(url)

File file.zip downloaded.
File file.zip has been unzipped.

2. 创建聊天加载器

我们有两个不同的 FacebookMessengerChatLoader 类，一个用于整个聊天目录，另一个用于加载单个文件。

directory_path = "./hogwarts"

from langchain_community.chat_loaders.facebook_messenger import (
    FolderFacebookMessengerChatLoader,
    SingleFileFacebookMessengerChatLoader,
)

loader = SingleFileFacebookMessengerChatLoader(
    path="./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json",
)

chat_session = loader.load()[0]
chat_session["messages"][:3]

[HumanMessage(content="Hi Hermione! How's your summer going so far?", additional_kwargs={'sender': 'Harry Potter'}),
 HumanMessage(content="Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?", additional_kwargs={'sender': 'Hermione Granger'}),
 HumanMessage(content="I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!", additional_kwargs={'sender': 'Harry Potter'})]

loader = FolderFacebookMessengerChatLoader(
    path="./hogwarts",
)

chat_sessions = loader.load()
len(chat_sessions)

3. 准备微调

调用 load() 会以人类消息的形式返回所有可提取的聊天消息。与聊天机器人对话时，对话通常遵循比真实对话更严格的交替对话模式。您可以选择合并消息”连续块”（来自同一发送者的连续消息），并选择一个发送者代表”AI”。经过微调的 LLM 将学习生成这些 AI 消息。

from langchain_community.chat_loaders.utils import (
    map_ai_messages,
    merge_chat_runs,
)

merged_sessions = merge_chat_runs(chat_sessions)
alternating_sessions = list(map_ai_messages(merged_sessions, "Harry Potter"))

# Now all of Harry Potter's messages will take the AIMessage class
# which maps to the 'assistant' role in OpenAI's training format
alternating_sessions[0]["messages"][:3]

[AIMessage(content="Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.", additional_kwargs={'sender': 'Harry Potter'}),
 HumanMessage(content="What is it, Potter? I'm quite busy at the moment.", additional_kwargs={'sender': 'Severus Snape'}),
 AIMessage(content="I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.", additional_kwargs={'sender': 'Harry Potter'})]

现在我们可以将其转换为 OpenAI 格式字典

from langchain_community.adapters.openai import convert_messages_for_finetuning

training_data = convert_messages_for_finetuning(alternating_sessions)
print(f"Prepared {len(training_data)} dialogues for training")

Prepared 9 dialogues for training

training_data[0][:3]

[{'role': 'assistant',
  'content': "Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately."},
 {'role': 'user',
  'content': "What is it, Potter? I'm quite busy at the moment."},
 {'role': 'assistant',
  'content': "I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister."}]

OpenAI 目前要求微调任务至少有 10 个训练示例，但建议大多数任务使用 50-100 个。由于我们只有 9 个聊天会话，我们可以将其分割（可选地设置部分重叠），使每个训练示例都包含完整对话的一部分。 Facebook 聊天会话（每人一个）通常跨越多天和多次对话，因此长范围依赖关系可能并不那么重要。

# Our chat is alternating, we will make each datapoint a group of 8 messages,
# with 2 messages overlapping
chunk_size = 8
overlap = 2

training_examples = [
    conversation_messages[i : i + chunk_size]
    for conversation_messages in training_data
    for i in range(0, len(conversation_messages) - chunk_size + 1, chunk_size - overlap)
]

len(training_examples)

4. 微调模型

现在是微调模型的时候了。请确保已安装 openai 并适当设置了 OPENAI_API_KEY。

pip install -qU  langchain-openai

import json
import time
from io import BytesIO

import openai

# We will write the jsonl file in memory
my_file = BytesIO()
for m in training_examples:
    my_file.write((json.dumps({"messages": m}) + "\n").encode("utf-8"))

my_file.seek(0)
training_file = openai.files.create(file=my_file, purpose="fine-tune")

# OpenAI audits each training file for compliance reasons.
# This make take a few minutes
status = openai.files.retrieve(training_file.id).status
start_time = time.time()
while status != "processed":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    status = openai.files.retrieve(training_file.id).status
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")

File file-ULumAXLEFw3vB6bb9uy6DNVC ready after 0.00 seconds.

文件准备好后，是时候启动训练任务了。

job = openai.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
)

请耐心等待，模型准备可能需要一些时间！

status = openai.fine_tuning.jobs.retrieve(job.id).status
start_time = time.time()
while status != "succeeded":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    job = openai.fine_tuning.jobs.retrieve(job.id)
    status = job.status

Status=[running]... 874.29s. 56.93s

print(job.fine_tuned_model)

ft:gpt-3.5-turbo-0613:personal::8QnAzWMr

5. 在 LangChain 中使用

您可以直接在 ChatOpenAI 模型类中使用生成的模型 ID。

from langchain_openai import ChatOpenAI

model = ChatOpenAI(
    model=job.fine_tuned_model,
    temperature=1,
)

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
    ]
)

chain = prompt | model | StrOutputParser()

for tok in chain.stream({"input": "What classes are you taking?"}):
    print(tok, end="", flush=True)

I'm taking Charms, Defense Against the Dark Arts, Herbology, Potions, Transfiguration, and Ancient Runes. How about you?

在 GitHub 上编辑此页面或提交问题。

通过 MCP 将这些文档连接到 Claude、VSCode 等以获得实时解答。

Popular Providers

Integrations by component

1. 下载数据

2. 创建聊天加载器

3. 准备微调

现在我们可以将其转换为 OpenAI 格式字典

4. 微调模型

5. 在 LangChain 中使用

Popular Providers

Integrations by component

​1. 下载数据

​2. 创建聊天加载器

​3. 准备微调

​现在我们可以将其转换为 OpenAI 格式字典

​4. 微调模型

​5. 在 LangChain 中使用

1. 下载数据

2. 创建聊天加载器

3. 准备微调

现在我们可以将其转换为 OpenAI 格式字典

4. 微调模型

5. 在 LangChain 中使用