2019年,国际语音交流协会INTERSPEECH第20届年会将于9月15日至19日在奥地利格拉茨举行。Interspeech是世界上规模最大,最全面的*语音领域会议,近2000名一线业界和学界人士将会参与包括主题演讲,Tutorial,论文讲解和主会展览等活动,本次阿里论文有8篇入选,本文为Shengkui Zhao, Chongjia Ni, Rong Tong, Bin Ma的论文《Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robust Speech Recognition》
点击下载论文
文章解读
自动语音识别系统(ASR)在实际生活中有着广泛的应用场景,不过通常由于周遭环境的噪声和混响的影响,自动语音识别的结果出现错误和不稳定的情况。提高自动语音识别系统的鲁棒性是推广其应用的一个关键问题。为了解决这个问题,增加语音增强模块和模型适应训练已经研究了很长时间。最近,在统一建模框架中利用同时训练降噪和语音识别的多任务联合学习方案显示出令人鼓舞的进展,不过目前模型训练仍高度依赖于成对的干净和噪声数据。为了克服这一限制,研究者开始引进对抗性生成网络(GAN)和对抗性训练方法到声学模型的训练中,由于无需复杂的前端设计和配对训练数据,大大简化了模型训练过程和要求。尽管对抗性生成网络在计算机视觉领域发展迅速,但目前只引进了常规对抗性生成网络和进行了有限的模型训练实验,而且常规对抗性生成网络存在模式崩溃缺陷常常导致训练失败问题。
在这项工作中,我们采用更先进的循环一致性对抗性生成网络(CycleGAN)来解决由于常规对抗性生成网络模式崩溃导致的训练失败问题,另外,结合最近流行的深度残差网络(ResNets),我们进一步将多任务学习方案扩展为多任务多网络联合学习方案,以实现更强大的降噪功能和模型自适应训练功能。
基于CHiME-4的单声道自动语音识别的实验结果表明,与最先进的联合学习方法相比(B),我们提出的方法通过实现更低的字错误率(WER)显着提高了自动语音识别系统的噪声鲁棒性。
基于循环一致性对抗性生成网络,我们提出的多任务多网络联合学习方案较好的解决了模式崩溃问题。
文章摘要
**Robustness of automatic speech recognition (ASR) systems is a critical issue due to noise and reverberations. Speech enhancement and model adaptation have been studied for long time to address this issue. Recently, the developments of multitask joint-learning scheme that addresses noise reduction and ASR criteria in a unified modeling framework show promising improvements, but the model training highly relies on paired clean-noisy data. To overcome this limit, the generative adversarial networks (GANs) and the adversarial training method are deployed, which have greatly simplified the model training process without the requirements of complex front-end design and paired training data. Despite the fast developments of GANs for computer visions, only regular GANs have been adopted for robust ASR. In this work, we adopt a more advanced cycleconsistency GAN (CycleGAN) to address the training failure problem due to mode collapse of regular GANs. Using deep residual networks (ResNets), we further expand the multi-task scheme to a multi-task multi-network joint-learning scheme for more robust noise reduction and model adaptation. Experiment results on CHiME-4 show that our proposed approach significantly improves the noise robustness of the ASR system by achieving much lower word error rates (WERs) than the stateof-the-art joint-learning approaches.
Index Terms: Robust speech recognition, convolutional neural
networks, acoustic model, generative adversarial networks
阿里云开发者社区整理