按令牌拆分 - 文本拆分器集成指南

语言模型有令牌限制。您不应超过令牌限制。当您拆分文本为块时，因此计算令牌数量是一个好主意。有许多分词器。当您计算文本中的令牌时，应使用与语言模型中相同的分词器。

js-tiktoken

js-tiktoken 是 OpenAI 创建的 BPE 分词器的 JavaScript 版本。

我们可以使用 tiktoken 来估计使用 TokenTextSplitter 所使用的令牌。对于 OpenAI 模型，这可能会更准确。

文本如何拆分：按传入的字符拆分。
块大小如何测量：通过 tiktoken 分词器测量。

npm install @langchain/textsplitters

import { TokenTextSplitter } from "@langchain/textsplitters";
import { readFileSync } from "fs";

// 示例：读取长文档
const stateOfTheUnion = readFileSync("state_of_the_union.txt", "utf8");

要使用 TokenTextSplitter 拆分，然后使用 tiktoken 合并块，请在初始化 TokenTextSplitter 时传入 encodingName（例如 cl100k_base）。请注意，此方法的拆分可能大于 tiktoken 分词器测量的块大小。

import { TokenTextSplitter } from "@langchain/textsplitters";

// 示例：使用 cl100k_base 编码
const splitter = new TokenTextSplitter({ encodingName: "cl100k_base", chunkSize: 10, chunkOverlap: 0 });

const texts = splitter.splitText(stateOfTheUnion);
console.log(texts[0]);

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

在 GitHub 上编辑此页面或提交问题。

连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

Popular Providers

General integrations

RAG integrations

按令牌拆分 - 文本拆分器集成指南

js-tiktoken

Popular Providers

General integrations

RAG integrations

​js-tiktoken

js-tiktoken