(icassp2020)论文阅读:TRAINING ASR MODELS BY GENERATION OF CONTEXTUAL INFORMATION
下载链接:https://arxiv.org/abs/1910.12367
主要思想:
利用海量的弱监督数据和部分常规的标注数据进行e2e模型训练。【这里的弱监督数据主要指的是仅含有上下文相关文本的音频数据(English social media videos along with their respective titles and post text)】。
模型结构:
文章模型结构无创新,采用的还是encoder-decoder的形式,主要还是multi-head attention的结构。
模型训练:
本篇文章的模型训练方式与一般的模型训练不同,采用了三部训练的方法。
(1) 【初始化训练:首先是一个初始化监督训练的阶段,这个阶段只要是让解码器学习正确地传递梯度信息,以调整编码器的声学特征】An initial supervised burn-in phase in which the decoder cross-attention learns to properly
communicate gradient information to adjust encoder acoustic features.
(2)【主要训练阶段:这个阶段混合常规学习和弱监督学习的损失函数。】A training phase driven by a mixture of the supervised and the weakly supervised loss functions, we refer to it as the train-main phase, in which the model expands its inventory of audio features and mappings between acoustic and linguistic cues.
(3)【微调阶段】A final supervised-only fine-tune phase which utilizes either the full encoder-decoder model trained in the train-main step, or the encoder component to be refined by the CTC loss.