自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器

自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型

本文是将葡萄牙语翻译成英语的一个高级示例。

目录

安装部署 Tensorflow

import tensorflow_datasets as tfds
import tensorflow as tf

import time
import numpy as np
import matplotlib.pyplot as plt

运行报错,提示

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-4c94d8100fcf> in <module>
----> 1 import tensorflow_datasets as tfds
      2 import tensorflow as tf
      3 
      4 import time
      5 import numpy as np

ModuleNotFoundError: No module named 'tensorflow_datasets'

安装tensorflow_datasets

(base) C:\Users\admin>activate my_star_space

(my_star_space) C:\Users\admin>pip install tensorflow-datasets
Collecting tensorflow-datasets
  Using cached tensorflow_datasets-4.4.0-py3-none-any.whl (4.0 MB)
Requirement already satisfied: dill in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.3.4)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-1.2.0-py3-none-any.whl (48 kB)
     |████████████████████████████████| 48 kB 21 kB/s
Requirement already satisfied: dataclasses in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.8)
Requirement already satisfied: importlib-resources in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (5.2.2)
Requirement already satisfied: promise in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.3)
Requirement already satisfied: tqdm in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (4.62.2)
Requirement already satisfied: attrs>=18.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (21.2.0)
Requirement already satisfied: requests>=2.19.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.26.0)
Requirement already satisfied: six in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.16.0)
Requirement already satisfied: future in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: numpy in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.19.5)
Requirement already satisfied: absl-py in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.13.0)
Requirement already satisfied: typing-extensions in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.7.4.3)
Requirement already satisfied: protobuf>=3.12.2 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.17.3)
Requirement already satisfied: termcolor in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: certifi>=2017.4.17 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2021.5.30)
Requirement already satisfied: idna<4,>=2.5 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (1.25.11)
Requirement already satisfied: charset-normalizer~=2.0.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2.0.4)
Requirement already satisfied: zipp>=3.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from importlib-resources->tensorflow-datasets) (3.5.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-metadata->tensorflow-datasets) (1.53.0)
Collecting absl-py
  Downloading absl_py-0.12.0-py3-none-any.whl (129 kB)
     |████████████████████████████████| 129 kB 14 kB/s
Requirement already satisfied: colorama in e:\anaconda3\envs\my_star_space\lib\site-packages (from tqdm->tensorflow-datasets) (0.4.4)
Installing collected packages: absl-py, tensorflow-metadata, tensorflow-datasets
  Attempting uninstall: absl-py
    Found existing installation: absl-py 0.13.0
    Uninstalling absl-py-0.13.0:
      Successfully uninstalled absl-py-0.13.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.6.0 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
Successfully installed absl-py-0.12.0 tensorflow-datasets-4.4.0 tensorflow-metadata-1.2.0
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'e:\anaconda3\envs\my_star_space\python.exe -m pip install --upgrade pip' command.

(my_star_space) C:\Users\admin>pip install tensorflow-datasets
Requirement already satisfied: tensorflow-datasets in e:\anaconda3\envs\my_star_space\lib\site-packages (4.4.0)
Requirement already satisfied: promise in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.3)
Requirement already satisfied: future in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: numpy in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.19.5)
Requirement already satisfied: absl-py in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.12.0)
Requirement already satisfied: termcolor in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: six in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.16.0)
Requirement already satisfied: tensorflow-metadata in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.2.0)
Requirement already satisfied: dataclasses in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.8)
Requirement already satisfied: requests>=2.19.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.26.0)
Requirement already satisfied: importlib-resources in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (5.2.2)
Requirement already satisfied: typing-extensions in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.7.4.3)
Requirement already satisfied: protobuf>=3.12.2 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.17.3)
Requirement already satisfied: tqdm in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (4.62.2)
Requirement already satisfied: dill in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.3.4)
Requirement already satisfied: attrs>=18.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (21.2.0)
Requirement already satisfied: certifi>=2017.4.17 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2021.5.30)
Requirement already satisfied: charset-normalizer~=2.0.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (1.25.11)
Requirement already satisfied: idna<4,>=2.5 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (3.2)
Requirement already satisfied: zipp>=3.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from importlib-resources->tensorflow-datasets) (3.5.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-metadata->tensorflow-datasets) (1.53.0)
Requirement already satisfied: colorama in e:\anaconda3\envs\my_star_space\lib\site-packages (from tqdm->tensorflow-datasets) (0.4.4)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'e:\anaconda3\envs\my_star_space\python.exe -m pip install --upgrade pip' command.

(my_star_space) C:\Users\admin>

设置输入pipeline

使用 TFDS 来导入 葡萄牙语-英语翻译数据集,该数据集来自于 TED 演讲开放翻译项目. 数据集包含来约 50000 条训练样本,1100 条验证样本,以及 2000 条测试样本。
自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器
自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

下载的时间较长,运行结果如下

  Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\admin\tensorflow_datasets\ted_hrlr_translate\pt_to_en\1.0.0...
Dl Completed...: 100%
1/1 [2:57:36<00:00, 10649.11s/ url]
Dl Size...: 100%
124/124 [2:57:36<00:00, 93.26s/ MiB]
Extraction completed...: 100%
1/1 [2:57:36<00:00, 10656.49s/ file]
Dataset ted_hrlr_translate downloaded and prepared to C:\Users\admin\tensorflow_datasets\ted_hrlr_translate\pt_to_en\1.0.0. Subsequent calls will reuse this data.

下载的文件保存在本地
自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器
其中dataset_info.json的内容为

{
  "citation": "@inproceedings{Ye2018WordEmbeddings,\n  author  = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig},\n  title   = {When and Why are pre-trained word embeddings useful for Neural Machine Translation},\n  booktitle = {HLT-NAACL},\n  year    = {2018},\n  }",
  "configDescription": "Translation dataset from pt to en in plain text.",
  "configName": "pt_to_en",
  "description": "Data sets derived from TED talk transcripts for comparing similar language pairs\nwhere one is high resource and the other is low resource.",
  "downloadSize": "131005909",
  "fileFormat": "tfrecord",
  "location": {
    "urls": [
      "https://github.com/neulab/word-embeddings-for-nmt"
    ]
  },
  "moduleName": "tensorflow_datasets.translate.ted_hrlr",
  "name": "ted_hrlr_translate",
  "splits": [
    {
      "name": "train",
      "numBytes": "10806586",
      "shardLengths": [
        "51785"
      ]
    },
    {
      "name": "validation",
      "numBytes": "231285",
      "shardLengths": [
        "1193"
      ]
    },
    {
      "name": "test",
      "numBytes": "383883",
      "shardLengths": [
        "1803"
      ]
    }
  ],
  "supervisedKeys": {
    "input": "pt",
    "output": "en"
  },
  "version": "1.0.0"
}

features.json文件的内容为:

{
    "type": "tensorflow_datasets.core.features.translation_feature.Translation",
    "content": {
        "languages": [
            "en",
            "pt"
        ]
    }
}

ted_hrlr_translate-test.tfrecord-00000-of-00001的格式
自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器

从训练数据集创建自定义子词分词器subwords tokenizer

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

运行脚本如下

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c90f5c60daf2> in <module>
----> 1 tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
      2     (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
      3 
      4 tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
      5     (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'

提示报错,使用tfds.deprecated.text

tokenizer_en =tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

sample_string = 'Transformer is awesome.'

tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer_en.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

assert original_string == sample_string

运行结果如下

Tokenized string is [7915, 1248, 7946, 7194, 13, 2799, 7877]
The original string: Transformer is awesome.

如果单词不在词典中,则分词器(tokenizer)通过将单词分解为子词来对字符串进行编码。

我们看一下示例

for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))

运行结果如下:

7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former 
13 ----> is 
2799 ----> awesome
7877 ----> .

将开始和结束标记(token)添加到输入和目标

BUFFER_SIZE = 20000
BATCH_SIZE = 64

def encode(lang1, lang2):
  lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
      lang1.numpy()) + [tokenizer_pt.vocab_size+1]

  lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
      lang2.numpy()) + [tokenizer_en.vocab_size+1]
  
  return lang1, lang2

为了使示例较小且相对较快,删除长度大于40个标记的样本

MAX_LENGTH = 40
def filter_max_length(x, y, max_length=MAX_LENGTH):
  return tf.logical_and(tf.size(x) <= max_length,
                        tf.size(y) <= max_length)
def tf_encode(pt, en):
  result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
  result_pt.set_shape([None])
  result_en.set_shape([None])

  return result_pt, result_en
train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
# 将数据集缓存到内存中以加快读取速度。
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)


val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)
pt_batch, en_batch = next(iter(val_dataset))
pt_batch, en_batch

运行结果如下

(<tf.Tensor: shape=(64, 38), dtype=int64, numpy=
 array([[8214,  342, 3032, ...,    0,    0,    0],
        [8214,   95,  198, ...,    0,    0,    0],
        [8214, 4479, 7990, ...,    0,    0,    0],
        ...,
        [8214,  584,   12, ...,    0,    0,    0],
        [8214,   59, 1548, ...,    0,    0,    0],
        [8214,  118,   34, ...,    0,    0,    0]], dtype=int64)>,
 <tf.Tensor: shape=(64, 40), dtype=int64, numpy=
 array([[8087,   98,   25, ...,    0,    0,    0],
        [8087,   12,   20, ...,    0,    0,    0],
        [8087,   12, 5453, ...,    0,    0,    0],
        ...,
        [8087,   18, 2059, ...,    0,    0,    0],
        [8087,   16, 1436, ...,    0,    0,    0],
        [8087,   15,   57, ...,    0,    0,    0]], dtype=int64)>)

附录 最终的运行结果

葡萄牙语翻译成英语
自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器

参考文献

https://tensorflow.google.cn/tutorials/text/transformer

星空智能对话机器人系列博客

上一篇:Optimizing Deeper Transformers on Small Datasets翻译


下一篇:PyCharm中导入tensorflow_datasets报错解决方法