【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding

To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages.


Transformer具有一定的复杂性,而越复杂的模型(参数量越大)往往冗余更多,但是结构上的缩减会带来比较大的影响。这篇文章的关键在于针对Transformer提出了专属的蒸馏方法。


There have been many model compression techniques (Han et al., 2016) proposed to accelerate deep model inference and reduce model size while maintaining accuracy. The most commonly used techniques include quantization (Gong et al., 2014),weights pruning (Han et al., 2015), and knowledge distillation (KD) (Romero et al., 2014).


  • 模型量化:经过量化算法对数值进行压缩和解压缩,从而达到减小模型大小和加速运算的目的。几乎所有量化方法都能实现压缩,但是并不是所有量化方法都能实现加速。量化实现加速的两个重要条件,首先量化算法要简单不引入过多额外计算开支,其次硬件方面适用运算库进行运算加速。因此量化在实用中比较难,尤其对于我这种不太懂硬件的。
  • 模型剪枝:根据我短时间的了解,剪枝方法分为结构化剪枝和非结构化剪枝。结构化剪枝即在大粒度上对模型进行结构级修剪,如剪掉卷积层中多余的卷积核(这似乎是我查阅资料中见过的唯一用法);非结构化剪枝即粒度更低的参数级修剪,根据L1范数或L2范数等对全体参数进行评估,按照比例将最不重要一部分参数置零,最终得到稀疏模型(最终模型的结构不发生变化,参数量也不会发生变化,仅仅是变得稀疏),通过一些稀疏矩阵分解的方法能够达到压缩的目的,但是只有在特定的支持稀疏矩阵运算的硬件上才能达到加速,因此实用性相对结构化剪枝更差。
  • 知识蒸馏:即训练学生教师模型的方法。值得一提的是知识蒸馏和结构化剪枝具有一定的相似性,二者都追求在模型结构方面进行缩减从而得到一个去冗余的小模型。剪枝的优势在于可以逐层剪枝并且在训练过程中剪枝,蒸馏的优势在于可以获取更多的泛化信息。二者或许可以相辅相成,也可能殊途同归。

The pre-training-then-fine-tuning paradigm firstly pretrains BERT on a large-scale unsupervised text corpus, then fine-tunes it on task-specific dataset,which greatly increases the difficulty of BERT distillation.Therefore, it is required to design an effective KD strategy for both training stages.


预训练模型的训练有两个阶段:无监督学习阶段和下游任务微调阶段。不同阶段的蒸馏策略是不同的,这个问题应该只存在于预训练模型的蒸馏上。


Specifically, we design three types of loss functions to fit different representations from BERTlayers:

  1. the output of the embedding layer;
  2. the hidden states and attention matrices derived from the Transformer layer;
  3. the logits output by the prediction layer.

蒸馏方法在刚提出时,学生的目标是模仿教师模型从输入到输出的映射关系。但是随着模型的复杂化,从输入到输出的跳跃太过巨大,所以这篇文章提出了对于几个中间层的拟合。这能帮助学生模型获取更多信息,但是实现难度也随着增加了。


we propose a novel two-stage learning framework including the general distillation and the task-specific distillation, as illustrated in Figure 1.
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding


Assuming that the student model has M Transformer layers and teacher
model has N Transformer layers, we start with choosing M out of N layers from the teacher model for the Transformer-layer distillation. Then a function n = g(m) is defined as the mapping function between indices from student layers to teacher layers, which means that the m-th layer of student model learns the information from the g(m)-th layer of teacher model.
.
Formally, the student can acquire knowledge from the teacher by minimizing the following objective:
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
where Llayer refers to the loss function of a given model layer (e.g., Transformer layer or embedding layer), fm(x) denotes the behavior function induced from the m-th layers and λm is the hyper-parameter that represents the importance of the m-th layer’s distillation.


由于涉及到层数的减少又要选择一些中间节点进行学习,那么必然会有这样的一个取舍问题,即学生层与老师层的对应关系。至于那种对应关系是最优的,需要通过实验来验证。最终蒸馏的损失函数由各个中间节点累加。在文章的实验部分,作者使用的方法是平均分配方法,将12层压缩到4层,即每3层设定一个蒸馏结点。


The proposed Transformer-layer distillation includes the attention based distillation and hidden states based distillation, which is shown in Figure 2.【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
the student learns to fit the matrices of multi-head attention in the teacher network, and the objective is defined as:
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
where h is the number of attention heads, Ai ∈ Rl×l refers to the attention matrix corresponding to the i-th head of teacher or student, l is the input
text length, and MSE() means the mean squared error loss function.


文章中反复强调的一点是Bert的Attention-Matrix蕴含了非常多的语义信息,非常重要。因此将其作为蒸馏的节点之一,适用均方法计算损失。另外还强调不需要对矩阵进行softmax操作。

In addition to the attention based distillation, we also distill the knowledge from the output of Transformer layer, and the objective is as follows:
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
where the matrices HS ∈ Rl×d’ and HT ∈ Rl×d refer to the hidden states of student and teacher networks respectively, which are calculated by Equation 4. The scalar values d and d’ denote the hidden sizes of teacher and student models, and d’ is often smaller than d to obtain a smaller student network.
The matrix Wh ∈ Rd’×d is a learnable linear transformation, which transforms the hidden states of student network into the same space as the teacher network’s states.


隐层状态蒸馏的一大难点在于:教师模型和学生模型的隐层维度可能是不同的。作者通过引入可学习矩阵W,对学生网络的隐状态矩阵进行自动扩展,解决了这个问题。


Similar to the hidden states based distillation, we also perform embedding-layer distillation and the objective is:
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
where the matrices ES and ET refer to the embeddings of student and teacher networks, respectively. In this paper, they have the same shape as
the hidden state matrices. The matrix We is a linear transformation playing a similar role as Wh.


在Bert中embedding的维度和隐层状态的维度是相同的,因此此蒸馏节点的处理方式和隐状态蒸馏相同。


In addition to imitating the behaviors of intermediate layers, we also use the knowledge distillation to fit the predictions of teacher model as in Hinton et al. (2015).
Specifically, we penalize the soft cross-entropy loss between the student network’s logits against the teacher’s logits:
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
where zS and zT are the logits vectors predicted by the student and teacher respectively, CE means the cross entropy loss, and t means the temperature value. In our experiment, we find that t = 1 performs well.


这里就是蒸馏分类器的logits,是最标准的蒸馏方法。文中提到了软交叉熵,没太明白什么意思。


【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
在进行微调阶段的蒸馏时,作者进行了数据增强。增强方法如上,就是利用bert和glove对词进行同义词替换。


实验结果【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding


The proposed two-stage TinyBERT learning framework consists of three key procedures: GD (General Distillation), TD (Task-specific Distillation) and DA (Data Augmentation). The performances of removing each individual learning procedure are analyzed and presented in Table 2.
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
The results indicate that all of the three procedures are crucial for the proposed method. The TD and DA has comparable effects in all the four tasks. We note that the task-specific procedures (TD and DA) are more helpful than the pre-training procedure (GD) on all of the tasks.


经过实验验证三个阶段的重要性,可以看到数据增强是很关键的。


We investigate the effects of distillation objectives on the TinyBERT learning. Several baselines are proposed including the learning without the Transformer-layer distillation (w/o Trm), the embedding-layer distillation (w/o Emb) or the prediction-layer distillation (w/o Pred)10 respectively. The results are illustrated in Table 3 and show that all the proposed distillation objectives are useful.
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding


可以看到,几种蒸馏都是有效的。最重要的是中间结果的蒸馏,即解题过程是重要的,答案结果反而不那么重要。


We also investigate the effects of different mapping functions n = g(m) on the TinyBERT learning. Our original TinyBERT as described in section 4.2 uses the uniform strategy, and we compare with two typical baselines including top-strategy (g(m) = m + N − M; 0 < m ≤ M) and bottom-strategy (g(m) = m; 0 < m ≤ M).The comparison results are presented in Table 4.
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding
We find that the top-strategy performs better than the bottom-strategy on MNLI, while being worse on MRPC and CoLA, which confirms the observations that different tasks depend on the knowledge from different BERT layers. The uniform strategy covers the knowledge from bottom to top layers of BERTBASE, and it achieves better performances than the other two baselines in all the tasks.


这个结果表明,不同任务所需要的知识来自不同层次,也就是说明在Transformer中每层的知识是有差别的。对于大多数任务,均匀策略是比较好的选择,因为它可以获取自底向上的全部知识。

上一篇:(二十八):Soft-Label Dataset Distillation and Text Dataset Distillation


下一篇:Noise Contrastive Estimation 前世今生——从 NCE 到 InfoNCE