语音识别-初识

2023-11-06 08:38:46

ASRT
https://blog.ailemon.net/2018/08/29/asrt-a-chinese-speech-recognition-system/
ASR-Automatic Speech Recognition &&&&&&&&&& Paddle Speech
涉及数据集：Aishell, wenetspeech, librispeech…
涉及方法：
① DeepSpeech2: End-to-End Speech Recognition in English and Mandarin;
② u2–Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition；
&&&&&&&&&&&&&&&
Conformer, Transformer, chunk-conformer
① SpeedySpeech: Efficient Neural Speech Synthesis (conformer);
&&&&&&&&&&&&&&&
其中解码方式还涉及，Attention, …and so on.
不同的解码方式，其 Character Error Rate - CER 也不尽相同。

About End to End :
E2E models combine the acoustic, pronunciation and language models into a single neural network, showing competitive results compared to conventional ASR systems.
There are mainly three popular E2E approaches, namely CTC, recurrent neural network transducer (RNN-T) and attention based encoder-decoder (AED).

模型包含三个部分，分别为共享的Encoder、CTC解码器、Attention解码器；

共享Encoder包含多层transformer或者conformer；
CTC解码器为一个全连接层和一个softmax层；
Attention解码器包含多层transformer层。

模型训练loss包含两个部分：CTC loss 和 AED loss，如下列公式所示，其中x为输入的声学特征，y为音频标注序列，第一项为 CTC loss，第二项为 AED loss。

为了支持流式语音识别，本文提出了Dynamic Chunk Training。为了使模型支持流式，需要限制共享Encoder看到未来信息。

如下图所示，(a)为标准的self attention，在每个输入时刻t都需要依赖整句的输入。针对这一问题，最简单的流式思路，限制当前时刻t只看到历史信息，不看任何未来信息，如图(b)所示，但该方案会极大的影响模型识别效果。而另外一种常用的思路，限制当前时刻t看到有限的未来时刻信息（比如看到未来C帧信息），如图©所示。

在模型训练中，Chunk的大小可以是固定的，也可以是动态调整的。

Last one ::
SoX（ Sound eXchange）是一个跨平台（Windows，Linux，MacOS 等）的命令行实用程序，可以将各种格式的音频文件转换为需要的其他格式。
SoX 还可以对输入的音频文件应用各种效果，也支持在大多数平台上播放和录制音频文件。

链接：https://www.jianshu.com/p/be8977de4a6b

码农公寓

相关文章