本文为英国剑桥大学达尔文学院(作者:Yongqiang Wang)的博士论文,共231页。
基于模型的方法是一种强大而灵活的鲁棒语音识别框架。该框架在过去几十年中得到了广泛的研究,并以多种方式进行了扩展,以处理各种声学因素造成的失真,包括说话人差异、信道失真和环境噪声。本文研究了在不同条件下基于模型的鲁棒语音识别方法,并对该框架进行了两个扩展。许多语音识别应用将受益于远程语音捕获,这样可以避免使用手持式或身体穿戴设备造成的问题。然而,由于扬声器与麦克风之间的距离较大,背景噪声和混响噪声都会严重破坏语音信号,并对语音识别精度产生负面影响。本工作将提出一个新的基于模型的方案,适用于只有一个远程麦克风可用的应用。为了补偿混响环境中先前语音帧对当前语音帧的影响,在标准声学模型中附加了扩展统计信息,以表示高斯分量级上下文干净特征向量窗口的分布。在给出这些统计数据和混响噪声模型参数的基础上,扩展了标准矢量泰勒级数(VTS)展开式,以补偿混响和背景噪声对声学模型参数的影响。最大似然(ML)估计算法也被开发来估计混响噪声模型参数。
提出了基于多混响环境下记录数据的声学模型参数自适应训练方法,这允许一个一致的ML框架来估计混响噪声参数和声学模型参数。实验是在人工破坏的语料库和真实混响环境中记录的语料库上进行的。实验结果表明,所提出的基于模型的方法显著提高了干净训练和自适应训练声学模型对混响的鲁棒性。由于语音信号通常同时受到多个声学因素的影响,基于模型的框架中的另一个重要方面是以灵活的方式将典型模型适配到具有多个声学因素的目标声学条件。
提出了一个声学因子分解框架来分解由不同声学因子引起的变异性,这是通过将每个声学因子与一个不同的因子变换相关联来实现的。以这种方式实现因子化适应,这赋予了基于模型的方法的额外灵活性。
本文的第二部分提出了声学分解的几种扩展。首先确定声学因子分解的关键是保持因子变换之间彼此独立,讨论了构造这种独立因子变换的几种方法。第一种是广泛使用的数据约束方法,它仅仅依靠自适应数据来实现独立属性。其次,变换约束方法利用部分声学因素如何影响语音信号的知识,并依赖不同形式的变换来实现分解。基于对ML估计因子变换之间相关性的数学分析,第三种方法明确地实施了独立性约束,因此它不依赖于平衡数据或特定形式的变换。将变换约束和显式独立约束分解方法应用于语音识别的说话人和噪声分解,得到两种灵活的基于模型的方案,该方案可以使用在其他未知噪声条件下估计的说话人变换。人工破坏语料库的实验结果证明了这些方案的灵活性,并且还说明了独立属性对因子分解的重要性。
Model-based approaches are a powerful andflexible framework for robust speech recognition. This framework has beenextensively investigated during the past decades and has been extended in anumber of ways to handle distortions caused by various acoustic factors,including speaker differences, channel distortions and environment noise. Thisthesis investigated model-based approaches to robust speech recognition indiverse conditions and proposed two extensions to this framework. Many speechrecognition applications will benefit from distant-talking speech capture. Thisavoids problems caused by using hand-held or body-worn equipment. However, dueto the large speaker-to-microphone distance, both background noise andreverberant noise will significantly corrupt speech signals and negativelyimpact speech recognition accuracies. This work will propose a new model-basedscheme for those applications in which only a single distant microphone isavailable. To compensate for the influence of previous speech frames on thecurrent speech frame in reverberant environments, extended statistics are appendedto the standard acoustic model to represent the distribution of a window ofcontextual clean feature vectors at the Gaussian component level. Given thesestatistics and the reverberant noise model parameters, the standard VectorTaylor series (VTS) expansion is extended to compensate the acoustic modelparameters for the effect of reverberation and background noise. A maximumlikelihood (ML) estimation algorithm is also developed to estimate thereverberant noise model parameters. Adaptive training of acoustic modelparameters on data recorded in multiple reverberant environments is alsoproposed. This allows a consistent ML framework to estimate both thereverberant noise parameters and the acoustic model parameters. Experiments areperformed on an artificially corrupted corpus and a corpus recorded in realreverberant environments. It is observed that the proposed model-based schemessignificantly improve the model robustness to reverberation for bothclean-trained and adaptively-trained acoustic models. As the speech signals areusually affected by multiple acoustic factors simultaneously, another importantaspect in the model-based framework is the ability to adapt canonical models tothe target acoustic condition with multiple acoustic factors in a flexiblemanner. An acoustic factorisation framework has been proposed to factorise thevariability caused by different acoustic factors. This is achieved byassociating each acoustic factor with a distinct factor transform. In this way,it enables factorised adaptation, which gives extra flexibility for model-basedapproaches. The second part of this thesis proposes several extensions toacoustic factorisation. It is first established that the key to acousticfactorisation is to keep the factor transforms independent of each other.Several approaches are discussed to construct such independent factortransforms. The first one is the widely used data constrained approach, whichsolely relies on the adaptation data to achieve the independence attribute. Thesecond, transform constrained approach utilises partial knowledge of howacoustic factors affect the speech signals and relies on different forms oftransforms to achieve factorisation. Based on a mathematical analysis of thedependence between ML estimated factor transforms, the third approachexplicitly enforces the independence constraint, thus it is not relying onbalanced data or particular forms of transforms. The transform constrained andthe explicit independence constrained factorisation approaches are applied tothe speaker and noise factorisation for speech recognition, yielding twoflexible model-based schemes which can use the speaker transforms estimated inone noise condition in other unseen noise conditions. Experimental results onartificially corrupted corpora demonstrate the flexibility of these schemes andalso illustrate the importance of the independence attribute to factorisation.
- 引言
- 语音识别系统
- 声学模型自适应与鲁棒性
- 混响环境的稳健性
- 声学因子分解框架
- 说话人与噪声分解
- 混响鲁棒性实验
- 声学分解实验
- 结论
附录A 不匹配函数的推导
附录B 最大假设的含义
附录C fCAT的最大似然估计
更多精彩文章请关注公众号: