TensorFlow 数据集集成

TensorFlow Datasets 是一组可供 TensorFlow 或其他 Python 机器学习框架（如 Jax）直接使用的数据集集合。所有数据集都以 tf.data.Datasets 形式提供，支持便捷高效的输入流水线。入门请参阅指南和数据集列表。

本笔记本演示如何将 TensorFlow Datasets 加载为下游使用的 Document 格式。

安装

您需要安装 tensorflow 和 tensorflow-datasets Python 包。

pip install -qU  tensorflow

pip install -qU  tensorflow-datasets

示例

以 mlqa/en 数据集为例。

MLQA（多语言问答数据集）是一个用于评估多语言问答性能的基准数据集。包含 7 种语言：阿拉伯语、德语、西班牙语、英语、印地语、越南语、中文。

主页：github.com/facebookresearch/MLQA

源码：tfds.datasets.mlqa.Builder

下载大小：72.21 MiB

# `mlqa/en` 数据集的特征结构：

FeaturesDict(
    {
        "answers": Sequence(
            {
                "answer_start": int32,
                "text": Text(shape=(), dtype=string),
            }
        ),
        "context": Text(shape=(), dtype=string),
        "id": string,
        "question": Text(shape=(), dtype=string),
        "title": Text(shape=(), dtype=string),
    }
)

import tensorflow as tf
import tensorflow_datasets as tfds

# 直接访问此数据集：
ds = tfds.load("mlqa/en", split="test")
ds = ds.take(1)  # 仅取一个示例
ds

<_TakeDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>

现在需要创建一个自定义函数，将数据集样本转换为 Document。这是必要的步骤，因为 TF 数据集没有标准格式，所以需要自定义转换函数。使用 context 字段作为 Document.page_content，其他字段放入 Document.metadata。

from langchain_core.documents import Document


def decode_to_str(item: tf.Tensor) -> str:
    return item.numpy().decode("utf-8")


def mlqaen_example_to_document(example: dict) -> Document:
    return Document(
        page_content=decode_to_str(example["context"]),
        metadata={
            "id": decode_to_str(example["id"]),
            "title": decode_to_str(example["title"]),
            "question": decode_to_str(example["question"]),
            "answer": decode_to_str(example["answers"]["text"][0]),
        },
    )


for example in ds:
    doc = mlqaen_example_to_document(example)
    print(doc)
    break

from langchain_community.document_loaders import TensorflowDatasetLoader
from langchain_core.documents import Document

loader = TensorflowDatasetLoader(
        dataset_name="mlqa/en",
        split_name="test",
        load_max_docs=3,
        sample_to_document_function=mlqaen_example_to_document,
)

TensorflowDatasetLoader 具有以下参数：

dataset_name：要加载的数据集名称
split_name：要加载的数据集分割名称，默认为 “train”
load_max_docs：加载文档数量的上限，默认为 100
sample_to_document_function：将数据集样本转换为 Document 的函数

docs = loader.load()
len(docs)

docs[0].page_content

docs[0].metadata

{'id': '5116f7cccdbf614d60bcd23498274ffd7b1e4ec7',
 'title': 'RMS Queen Mary 2',
 'question': 'What year did Queen Mary 2 complete her journey around South America?',
 'answer': '2006'}

在 GitHub 上编辑此页面或提交问题。

将这些文档连接到 Claude、VSCode 等，通过 MCP 获取实时解答。

Popular Providers

Integrations by component

安装

示例

Popular Providers

Integrations by component

​安装

​示例

安装

示例