总述
webrtc的降噪模块主要分为3个部分:模块初始化、噪声分析(analysis)、噪声抑制。模块初始化是在最开始为降噪模块设置参数以及初始化一些状态的值。噪声分析模块,主要进行噪声估计、speech/noise概率计算等。噪声抑制模块则根据前面计算的语音概率和噪声使用维纳滤波来抑制噪声。下面首先对一些相关公式进行介绍,再对整个算法流程进行分析。
相关算法公式分析
假设麦克风接收到的信号为 y ( t ) y(t) y(t),语音信号为为 x ( t ) x(t) x(t),噪声信号为 n ( t ) n(t) n(t),则 y ( t ) = x ( t ) + n ( t ) ( 1 − 1 ) y(t)=x(t)+n(t)\ \ \ (1-1) y(t)=x(t)+n(t) (1−1),在频域的表达则是 Y ( ω ) = X ( ω ) + N ( ω ) ( 1 − 2 ) Y(\omega)=X(\omega)+N(\omega)\ \ \ (1-2) Y(ω)=X(ω)+N(ω) (1−2),噪声抑制即希望从 y ( t ) y(t) y(t)中尽可能恢复 x ( t ) x(t) x(t),以提升纯净度和可懂度。整体过程如下:对于接收到的一帧含噪语音,进行初始噪声估计,定义语音概率函数,然后根据噪声计算每一帧的若干特征;根据多个特征(features)计算speech probability,然后对speech probability进行动态加权(利用特征和阈值参数(threshold parameters));根据已经计算出的基于features的speech probability 优化语音概率函数(speech probability function);然后使用优化后的函数来更新每帧的初始噪声估计。
-
信噪比(SNR)计算
信噪比的计算主要分为先验信噪比(priori SNR)和后验信噪比(posteriori SNR)计算,这两个信噪比会被用与计算似然比(likelihood ratio,LRT)。
后验信噪比:用信号功率与噪声功率之比: σ k ( m ) = ∣ Y k ( m ) ∣ 2 ⟨ ∣ N k ( m ) ∣ 2 ⟩ ( 1 − 3 ) \sigma_k(m)=\displaystyle{\frac{|Y_k(m)|^2}{\langle |N_k(m)|^2\rangle}}\ \ \ (1-3) σk(m)=⟨∣Nk(m)∣2⟩∣Yk(m)∣2 (1−3), k k k是频率的index, m m m为时间的,尖括号指的是求统计平均,从分子可以看出这是一个瞬态量
先验信噪比:语音信号和噪声信号功率之比: ρ k ( m ) = ⟨ ∣ X k ( m ) ∣ 2 ⟩ ⟨ ∣ N k ( m ) ∣ 2 ⟩ ( 1 − 4 ) \rho_k(m)=\displaystyle{\frac{\langle|X_k(m)|^2\rangle}{\langle|N_k(m)|^2\rangle}}\ \ \ (1-4) ρk(m)=⟨∣Nk(m)∣2⟩⟨∣Xk(m)∣2⟩ (1−4)
在webrtc中,为了节约计算量,不使用平方而使用幅度值,前面两个式子变为 σ k ( m ) ≈ ∣ Y k ( m ) ∣ ⟨ ∣ N k ( m ) ∣ ⟩ ( 1 − 5 ) \sigma_k(m)\displaystyle{\approx \frac{|Y_k(m)|}{\langle|N_k(m)|\rangle}}\ \ \ (1-5) σk(m)≈⟨∣Nk(m)∣⟩∣Yk(m)∣ (1−5)和 ρ k ( m ) ≈ ⟨ ∣ X k ( m ) ∣ ⟩ ⟨ ∣ N k ( m ) ∣ ⟩ ( 1 − 6 ) \rho_k(m)\displaystyle{\approx \frac{\langle|X_k(m)|\rangle}{\langle|N_k(m)|\rangle}}\ \ \ (1-6) ρk(m)≈⟨∣Nk(m)∣⟩⟨∣Xk(m)∣⟩ (1−6)
由于纯净语音信号是未知的,所以webrtc中使用过去帧的先验SNR和当前帧的SNR进行平滑来得到先验信噪比: ρ k ( m ) = γ d d H ( k , m − 1 ) ∣ Y k ( m − 1 ) ∣ ⟨ ∣ N k ( m − 1 ) ∣ ⟩ + ( 1 − γ d d ) max ( σ k ( m ) − 1.0 ) ( 1 − 7 ) \displaystyle{\rho_k(m)=\gamma_{dd}H(k,m-1)\frac{|Y_k(m-1)|}{\langle|N_k(m-1)|\rangle}+(1-\gamma_{dd})\max(\sigma_k(m)-1.0)}\ \ \ (1-7) ρk(m)=γddH(k,m−1)⟨∣Nk(m−1)∣⟩∣Yk(m−1)∣+(1−γdd)max(σk(m)−1.0) (1−7)
其中 H ( k , m − 1 ) H(k,m-1) H(k,m−1)是前一帧的维纳滤波器的系数, H H H和 Y Y Y相乘实际上就是噪声抑制,也即 X k ( m − 1 ) ≈ H ( k , m − 1 ) Y ( m − 1 ) X_k(m-1)\approx H(k,m-1)Y(m-1) Xk(m−1)≈H(k,m−1)Y(m−1);而 σ k ( m ) \sigma_k(m) σk(m)是对当前帧后验信噪比的一个粗估计,因为 σ k ( m ) − 1.0 = ∣ X k ( m ) + N k ( m ) ∣ ⟨ ∣ N k ( m ) ∣ ⟩ − 1 ≈ ∣ X k ( m ) ∣ ⟨ ∣ N k ( m ) ∣ ⟩ = ρ k ( m ) \displaystyle{\sigma_k(m)-1.0=\frac{|X_k(m)+N_k(m)|}{\langle|N_k(m)|\rangle}-1\approx\frac{|X_k(m)|}{\langle|N_k(m)|\rangle}=\rho_k(m)} σk(m)−1.0=⟨∣Nk(m)∣⟩∣Xk(m)+Nk(m)∣−1≈⟨∣Nk(m)∣⟩∣Xk(m)∣=ρk(m)。因此 ( 1 − 7 ) (1-7) (1−7)实际上就是前一帧的先验信噪比估计与当前帧的先验信噪比估计的加权求和, γ d d \gamma_{dd} γdd(Decision Directed,DD)是一个平滑因子,其值越大表示平滑力度越大,但是信噪比更新就越慢、延迟越高。在webrtc中,这个值被设置为0.98。
-
webrtc相关语音特征
webrtc在其代码中主要使用了频谱平坦度(spectral flatness)、似然比均值(LRT mean value)、噪声模板相似度(spectral template difference)作为特征,下对特征进行介绍:
-
LRT 均值
时间平滑后的似然比的均值时衡量语音、噪声状态的可靠指标,其计算公式为:
F 1 = log ( ∏ k Δ ~ ( m ) 1 / N ) = 1 N ∑ k = 1 N log ( Δ ~ k ( m ) ) ( 1 − 8 ) F_1=\displaystyle{\log{(\prod_k\tilde{\Delta}(m)^{1/N})}=\frac{1}{N}\sum_{k=1}^{N}\log{ (\tilde{\Delta}_k (m))} }\ \ \ (1-8) F1=log(k∏Δ~(m)1/N)=N1k=1∑Nlog(Δ~k(m)) (1−8)
其中 Δ \Delta Δ将在后面speech probability时介绍,在使用LRT均值特征时,通常使用一个tanh相关函数作为映射函数: M ( z ) = 0.5 ∗ ( tanh ( w 1 z 1 ) + 0.5 ) M(z)=0.5*(\tanh {(w_1z_1)}+0.5) M(z)=0.5∗(tanh(w1z1)+0.5),其中 F 1 F_1 F1即为LRT均值特征, w 1 w_1 w1是一个用于控制从0到1映射的平滑性的参数,阈值参数 T 1 T_1 T1需要根据其他参数设定。
-
频谱平坦度
通常认为语音比噪声有着更多的谐波,其表现就是语音信号在基频和谐波除会有许多能量峰值,而噪声信号的频谱则相对平坦,因此频谱平坦度可以用来区分语音和噪声。计算此特征时,语音被划分为若干个子频带,每个频带的频点数量可以不同。频谱平坦度的计算公式为:
F 2 = F l a t n e s s = ∏ k ∣ Y k ∣ 1 / N 1 N ∑ k ∣ Y k ∣ = ∏ k = 0 N − 1 ∣ Y k ( m ) ∣ N 1 N ∑ k = 0 N − 1 ∣ Y k ( m ) ∣ = e 1 N ∑ k = 0 N − 1 ln Y k ( m ) 1 N ∑ k = 0 N − 1 ∣ Y k ( m ) ∣ ( 1 − 9 ) \displaystyle{F_2=Flatness=\frac{\prod_k|Y_k|^{1/N}}{\frac{1}{N}\sum_k|Y_k|}=\frac{\sqrt[N]{\prod_{k=0}^{N-1}|Y_k(m)|}}{\frac{1}{N}\sum_{k=0}^{N-1}|Y_k(m)|}=\frac{e^{\frac{1}{N}\sum_{k=0}^{N-1}\ln Y_k(m)}}{\frac{1}{N}\sum_{k=0}^{N-1}|Y_k(m)|}}\ \ \ (1-9) F2=Flatness=N1∑k∣Yk∣∏k∣Yk∣1/N=N1∑k=0N−1∣Yk(m)∣N∏k=0N−1∣Yk(m)∣ =N1∑k=0N−1∣Yk(m)∣eN1∑k=0N−1lnYk(m) (1−9)
其中 N N N为子带中的频点数量。噪声的 F 2 F_2 F2比较大且变化幅度较小,语音的 F 2 F_2 F2比较小且变化幅度较大
-
频谱模板相似度(差异)
特征含义:定义一个“噪声模板”,然后求输入频谱与该模板的差异。如果差异较小,说明输入帧接近模板,也说明输入是噪声的概率较大。此特征具有比频谱平坦度更强的泛化性:也即在某些情况下,频谱模板相似度可以简化为对频谱平坦度的测量。
平稳噪声的频谱比语音的频谱更加平坦,因此我们认为噪声谱的包络倾向于保持不变。频谱模板通过频谱中极有可能是噪声或者语音停顿的区间段来确定,并且只在语音概率小于阈值的时候对噪声进行更新。这种模板方法也可以用来解决非平稳噪声(通过产生一些非平稳噪声的模板)。在webrtc中模板相似度定义为:
J = ∑ k ∣ Y k ( m ) − N k ( m ) ∣ 2 = ∑ k ∣ Y k ( m ) − [ α ⋅ a k ( m ) + μ ] ∣ 2 ( 1 − 10 ) J=\displaystyle{\sum_k |Y_k(m)-N_k(m)|^2=\sum_k |Y_k(m)-[\alpha \cdot a_k(m)+\mu]|^2}\ \ \ (1-10) J=k∑∣Yk(m)−Nk(m)∣2=k∑∣Yk(m)−[α⋅ak(m)+μ]∣2 (1−10)
其中 Y Y Y为输入频谱, a k ( m ) a_k(m) ak(m)为噪声模板频谱, α \alpha α和 μ \mu μ是控制振幅和线性偏置的参数,归一化的模板相似度定义如下:
F 3 = J 1 W ∑ m = 0 W ∑ k ∣ Y k ( m ) ∣ 2 = ∑ k ∣ Y k ( m ) − [ α ⋅ a k ( m ) + μ ] ∣ 2 1 W ∑ m = 0 W ∑ k ∣ Y k ( m ) ∣ 2 ( 1 − 11 ) F_3=\displaystyle{\frac{J}{\frac{1}{W} \displaystyle{\sum_{m=0}^W \sum_k}|Y_k(m)|^2 }}=\displaystyle{\frac{\displaystyle{\sum_k |Y_k(m)-[\alpha \cdot a_k(m)+\mu]|^2}}{\frac{1}{W}\displaystyle{\sum_{m=0}^W\sum_k}|Y_k(m)|^2}}\ \ \ (1-11) F3=W1m=0∑Wk∑∣Yk(m)∣2J=W1m=0∑Wk∑∣Yk(m)∣2k∑∣Yk(m)−[α⋅ak(m)+μ]∣2 (1−11)
即对所有频率以及一定时间窗口范围内求平均,也可以给不同频带设置不同的权重:
J = W k ∑ k ∣ Y k ( m ) − [ α ⋅ a k ( m ) + μ ] ∣ 2 ( 1 − 12 ) J=\displaystyle{W_k \sum_k |Y_k(m)-[\alpha \cdot a_k(m)+\mu]|^2}\ \ \ (1-12) J=Wkk∑∣Yk(m)−[α⋅ak(m)+μ]∣2 (1−12)
可以利用上述几个特征定义一个speech/noise概率,也被称为“基于特征的语音概率”:
q m ( H ∣ F 1 , F 2 , F 3 ) = q m = γ q q m − 1 + ( 1 − γ q ) [ τ 1 M 1 ( F 1 − T 1 , w 1 ) + τ 2 M 2 ( F 2 − T 2 , w 2 ) + τ 3 M 3 ( F 3 − T 3 , w 3 ) ] ( 1 − 13 ) q_m(H|F_1,F_2,F_3)=q_m=\gamma_q q_{m-1}+(1-\gamma_q)[\tau_1M_1(F_1-T_1,w_1)+\tau_2M_2(F_2-T_2,w_2)+\tau_3M_3(F_3-T_3,w_3)]\\ (1-13) qm(H∣F1,F2,F3)=qm=γqqm−1+(1−γq)[τ1M1(F1−T1,w1)+τ2M2(F2−T2,w2)+τ3M3(F3−T3,w3)](1−13)
H H H表示speeech/noise状态, F F F是被测特征, T T T为特征阈值, w w w代表映射函数的形状、宽度特征, M M M为映射函数如tanh函数等,其作用是根据测量出的特征以及阈值和宽度参数,将每一帧划分为语音(M接近1)或噪声(M接近0)。其作用是 τ \tau τ为权重以使结果更加稳定。本算法中的映射函数取为 M ( z ) = 0.5 ∗ ( tanh ( w z ) + 0.5 ) M(z)=0.5*(\tanh{(wz)+0.5}) M(z)=0.5∗(tanh(wz)+0.5)
- speech/noise 概率计算
首先定义语音状态为 H k , m = H 1 k , m H^{k,m}=H_1^{k,m} Hk,m=H1k,m,噪声状态为 H k , m = H 0 k , m H^{k,m}=H_0^{k,m} Hk,m=H0k,m,则speech/noise 概率为: P ( H k , m ∣ Y k ( m ) , { F } ) P(H^{k,m}|Y_k(m),\{F\}) P(Hk,m∣Yk(m),{F}),也即概率取决于输入频谱以及前面提到的几个特征。根据贝叶斯准则有以下常见公式:
P ( A ∣ B ) = P ( A , B ) P ( B ) P(A|B)=\displaystyle{\frac{P(A,B)}{P(B)}} P(A∣B)=P(B)P(A,B)以及衍生出的 P ( A ∣ B , C ) = P ( A , B ∣ C ) P ( B ∣ C ) P(A|B,C)=\displaystyle{\frac{P(A,B|C)}{P(B|C)}} P(A∣B,C)=P(B∣C)P(A,B∣C)
P ( B 1 ∣ A ) = P ( A ∣ B 1 ) P ( B 1 ) P ( A ∣ B 1 ) P ( B 1 ) + P ( A ∣ B 2 ) P ( B 2 ) P(B_1|A)=\displaystyle{\frac{P(A|B_1)P(B_1)}{P(A|B_1)P(B_1)+P(A|B_2)P(B_2)}} P(B1∣A)=P(A∣B1)P(B1)+P(A∣B2)P(B2)P(A∣B1)P(B1),其中 B 1 B_1 B1和 B 2 B_2 B2是对立事件,则标准化的语音概率:
P ( H 1 ∣ Y k ( m ) , { F } ) = P ( Y k ( m ) , { F } ∣ H 1 ) P ( H 1 ) P ( Y k ( m ) , { F } ∣ H 1 ) P ( H 1 ) + P ( Y k ( m ) , { F } ∣ H 0 ) P ( H 0 ) P(H_1|Y_k(m),\{F\})=\displaystyle{\frac{P(Y_k(m),\{F\}|H_1)P(H_1)}{P(Y_k(m),\{F\}|H_1)P(H_1)+P(Y_k(m),\{F\}|H_0)P(H_0)}} P(H1∣Yk(m),{F})=P(Yk(m),{F}∣H1)P(H1)+P(Yk(m),{F}∣H0)P(H0)P(Yk(m),{F}∣H1)P(H1)
= P ( Y k ( m ) ∣ { F } , H 1 ) P ( { F } ∣ H 1 ) P ( H 1 ) P ( Y k ( m ) ∣ { F } , H 1 ) P ( { F } ∣ H 1 ) P ( H 1 ) + P ( Y k ( m ) ∣ { F } , H 0 ) P ( { F } ∣ H 0 ) P ( H 0 ) =\displaystyle{\frac{P(Y_k(m)|\{F\},H_1)P(\{F\}|H_1)P(H_1)}{P(Y_k(m)|\{F\},H_1)P(\{F\}|H_1)P(H_1)+P(Y_k(m)|\{F\},H_0)P(\{F\}|H_0)P(H_0)}} =P(Yk(m)∣{F},H1)P({F}∣H1)P(H1)+P(Yk(m)∣{F},H0)P({F}∣H0)P(H0)P(Yk(m)∣{F},H1)P({F}∣H1)P(H1)
= P ( Y k ( m ) ∣ { F } , H 1 ) P ( { F } , H 1 ) P ( Y k ( m ) ∣ { F } , H 1 ) P ( { F } , H 1 ) + P ( Y k ( m ) ∣ { F } , H 0 ) P ( { F } , H 0 ) =\displaystyle{\frac{P(Y_k(m)|\{F\},H_1)P(\{F\},H_1)}{P(Y_k(m)|\{F\},H_1)P(\{F\},H_1)+P(Y_k(m)|\{F\},H_0)P(\{F\},H_0)}} =P(Yk(m)∣{F},H1)P({F},H1)+P(Yk(m)∣{F},H0)P({F},H0)P(Yk(m)∣{F},H1)P({F},H1)
= P ( Y k ( m ) ∣ { F } , H 1 ) P ( { F } , H 1 ) P ( { F } ) P ( Y k ( m ) ∣ { F } , H 1 ) P ( { F } , H 1 ) P ( { F } ) + P ( Y k ( m ) ∣ { F } , H 0 ) P ( { F } , H 0 ) P ( { F } ) =\displaystyle{\frac{P(Y_k(m)|\{F\},H_1) \frac{P(\{F\},H_1)}{P(\{F\})} }{P(Y_k(m)|\{F\},H_1)\frac{P(\{F\},H_1)}{P(\{F\})}+P(Y_k(m)|\{F\},H_0)\frac{P(\{F\},H_0)}{P(\{F\})}}} =P(Yk(m)∣{F},H1)P({F})P({F},H1)+P(Yk(m)∣{F},H0)P({F})P({F},H0)P(Yk(m)∣{F},H1)P({F})P({F},H1)
= P ( Y k ( m ) ∣ { F } , H 1 ) q k , m ( H 1 ∣ F ) P ( Y k ( m ) ∣ { F } , H 1 ) q k , m ( H 1 ∣ F ) + P ( Y k ( m ) ∣ { F } , H 0 ) q k , m ( H 0 ∣ F ) ( 1 − 14 ) =\displaystyle{\frac{P(Y_k(m)|\{F\},H_1) q_{k,m}(H_1|F) }{P(Y_k(m)|\{F\},H_1)q_{k,m}(H_1|F)+P(Y_k(m)|\{F\},H_0)q_{k,m}(H_0|F)}}\ \ \ (1-14) =P(Yk(m)∣{F},H1)qk,m(H1∣F)+P(Yk(m)∣{F},H0)qk,m(H0∣F)P(Yk(m)∣{F},H1)qk,m(H1∣F) (1−14)
其中 q k , m ( H ∣ F ) = P ( { F } , H ) P ( { F } ) = P ( H ∣ { F } ) q_{k,m}(H|F)=\displaystyle{\frac{P(\{F\},H)}{P(\{F\})}=P(H|\{F\})} qk,m(H∣F)=P({F})P({F},H)=P(H∣{F})为前面提到的"基于特征的语音概率",记 q k , m ( H 1 ∣ F ) = P ( H 1 ∣ { F } ) = q q_{k,m}(H_1|F)=\displaystyle{P(H_1|\{F\})=q} qk,m(H1∣F)=P(H1∣{F})=q以及 q k , m ( H 0 ∣ F ) = P ( H 0 ∣ { F } ) = 1 − q q_{k,m}(H_0|F)=\displaystyle{P(H_0|\{F\})=1-q} qk,m(H0∣F)=P(H0∣{F})=1−q,则式 ( 1 − 14 ) (1-14) (1−14)变为:
P ( H 1 ∣ Y k ( m ) , { F } ) = q Δ k q Δ k + 1 − q ( 1 − 15 ) P(H_1|Y_k(m),\{F\})=\displaystyle{\frac{q\Delta _k}{q\Delta _k+1-q}}\ \ \ (1-15) P(H1∣Yk(m),{F})=qΔk+1−qqΔk (1−15)
其中似然比(LR) Δ k \Delta _k Δk为: Δ k = P ( Y k ( m ) ∣ { F } , H 1 ) P ( Y k ( m ) ∣ { F } , H 0 ) ( 1 − 16 ) \Delta _k=\displaystyle{\frac{P(Y_k(m)|\{F\},H_1)}{P(Y_k(m)|\{F\},H_0)}}\ \ \ \ (1-16) Δk=P(Yk(m)∣{F},H0)P(Yk(m)∣{F},H1) (1−16),此即为在前面LRT均值部分提到的量。 Δ k \Delta_k Δk表达式中的 P ( Y k ( m ) ∣ H 0 , 1 , { F } ) P(Y_k(m)|H_{0,1},\{F\}) P(Yk(m)∣H0,1,{F})是通过线性状态模型和针对语音和噪声频谱系数的高斯概率密度函数(PDF)确定的。语音状态下有 Y k ( m ) = X k ( m ) + N k ( m ) , H = H 1 Y_k(m)=X_k(m)+N_k(m),H=H_1 Yk(m)=Xk(m)+Nk(m),H=H1,噪声状态下有 Y k ( m ) = N k ( m ) , H = H 0 Y_k(m)=N_k(m),H=H_0 Yk(m)=Nk(m),H=H0。若概率密度函数使用复数系数 { X k ( m ) , N k ( m ) } \{X_k(m),N_k(m)\} {Xk(m),Nk(m)},为简便起见省略表示为 { X k , N k } \{X_k,N_k\} {Xk,Nk},则 P ( Y k ( m ) ∣ H , { F } ) P(Y_k(m)|H,\{F\}) P(Yk(m)∣H,{F})表示为:
P ( Y k ( m ) ∣ H 0 , { F } ) = P ( Y k ( m ) ∣ H 0 ) ∝ 1 ⟨ ∣ N k ∣ 2 ⟩ exp ( − ∣ Y k ∣ 2 ⟨ ∣ N k ∣ 2 ⟩ ) ( 1 − 17 ) P(Y_k(m)|H_0,\{F\})=P(Y_k(m)|H_0)\propto \displaystyle{\frac{1}{\langle|N_k|^2\rangle}\exp (-\frac{|Y_k|^2}{\langle|N_k|^2\rangle})}\ \ \ (1-17) P(Yk(m)∣H0,{F})=P(Yk(m)∣H0)∝⟨∣Nk∣2⟩1exp(−⟨∣Nk∣2⟩∣Yk∣2) (1−17)
P ( Y k ( m ) ∣ H 1 , { F } ) = P ( Y k ( m ) ∣ H 1 ) ∝ 1 ⟨ ∣ N k ∣ 2 ⟩ + ⟨ ∣ X k ∣ 2 ⟩ exp ( − ∣ Y k ∣ 2 ⟨ ∣ N k ∣ 2 ⟩ + ⟨ ∣ X k ∣ 2 ⟩ ) ( 1 − 18 ) P(Y_k(m)|H_1,\{F\})=P(Y_k(m)|H_1)\propto \displaystyle{\frac{1}{\langle|N_k|^2\rangle+\langle|X_k|^2\rangle}\exp (-\frac{|Y_k|^2}{\langle|N_k|^2\rangle+\langle|X_k|^2\rangle})}\ \ \ (1-18) P(Yk(m)∣H1,{F})=P(Yk(m)∣H1)∝⟨∣Nk∣2⟩+⟨∣Xk∣2⟩1exp(−⟨∣Nk∣2⟩+⟨∣Xk∣2⟩∣Yk∣2) (1−18)
则似然比为:
Δ k = P ( Y k ( m ) ∣ H 1 ) P ( Y k ( m ) ∣ H 0 ) = ⟨ ∣ N k ∣ 2 ⟩ ⟨ ∣ N k ∣ 2 ⟩ + ⟨ ∣ X k ∣ 2 ⟩ exp ( ∣ Y k ∣ 2 ⟨ ∣ N k ∣ 2 ⟩ − ∣ Y k ∣ 2 ⟨ ∣ N k ∣ 2 ⟩ + ⟨ ∣ X k ∣ 2 ⟩ ) \Delta_k=\displaystyle{\frac{P(Y_k(m)|H_1)}{P(Y_k(m)|H_0)}=\frac{\langle|N_k|^2\rangle}{\langle|N_k|^2\rangle+\langle|X_k|^2\rangle}\exp (\frac{|Y_k|^2}{\langle|N_k|^2\rangle}-\frac{|Y_k|^2}{\langle|N_k|^2\rangle+\langle|X_k|^2\rangle})} Δk=P(Yk(m)∣H0)P(Yk(m)∣H1)=⟨∣Nk∣2⟩+⟨∣Xk∣2⟩⟨∣Nk∣2⟩exp(⟨∣Nk∣2⟩∣Yk∣2−⟨∣Nk∣2⟩+⟨∣Xk∣2⟩∣Yk∣2)
= 1 1 + ρ k ( m ) exp ( σ k ( m ) − σ k ( m ) 1 + ρ k ( m ) ) = 1 1 + ρ k ( m ) exp ( σ k ( m ) ρ k ( m ) 1 + ρ k ( m ) ) ( 1 − 19 ) =\displaystyle{\frac{1}{1+\rho_k(m)}\exp (\sigma_k(m)-\frac{\sigma_k(m)}{1+\rho_k(m)})=\frac{1}{1+\rho_k(m)}\exp (\frac{\sigma_k(m)\rho_k(m)}{1+\rho_k(m)})}\ \ \ (1-19) =1+ρk(m)1exp(σk(m)−1+ρk(m)σk(m))=1+ρk(m)1exp(1+ρk(m)σk(m)ρk(m)) (1−19)
其中 σ k ( m ) \sigma_k(m) σk(m)和 ρ k ( m ) \rho_k(m) ρk(m)即前面提到的后验和先验信噪比。帧与帧之间的频变会使 Δ k \Delta_k Δk有比较大的波动,因而使用时间平滑处理后的似然银子: log ( Δ ~ k ( m ) ) = γ l r t log ( Δ ~ k ( m − 1 ) ) + ( 1 − γ l r t ) log ( Δ k ( m ) ) ( 1 − 20 ) \log {(\tilde{\Delta}_k(m))}=\gamma_{lrt}\log {(\tilde{\Delta}_k(m-1))}+(1-\gamma_{lrt})\log{(\Delta_k(m))}\ \ \ (1-20) log(Δ~k(m))=γlrtlog(Δ~k(m−1))+(1−γlrt)log(Δk(m)) (1−20)
对上式求集合平均即 ( 1 − 8 ) (1-8) (1−8)描述的似然比特征。
- 噪声更新
一旦speech/noise probability计算出来,就可以根据其对噪声进行平滑更新,更新方式如下:
∣ N ^ k ( m ) ∣ = γ n ∣ N ^ k ( m − 1 ) ∣ + ( 1 − γ n ) [ P ( H 1 ∣ Y k ( m ) ) N ^ k ( m − 1 ) + P ( H 0 ∣ Y k ( m ) ) Y k ( m ) ] ( 1 − 21 ) |\hat{N}_k(m)|=\gamma_n |\hat{N}_k(m-1)|+(1-\gamma_n)[P(H_1|Y_k(m))\hat{N}_k(m-1)+P(H_0|Y_k(m))Y_k(m)]\ \ \ (1-21) ∣N^k(m)∣=γn∣N^k(m−1)∣+(1−γn)[P(H1∣Yk(m))N^k(m−1)+P(H0∣Yk(m))Yk(m)] (1−21)
其中 N ^ k ( m − 1 ) \hat{N}_k(m-1) N^k(m−1)是上一帧的估计噪声, P ( H 0 , 1 ∣ Y k ( m ) ) P(H_{0,1}|Y_k(m)) P(H0,1∣Yk(m))等于 P ( H 0 , 1 ∣ Y k ( m ) , { F } ) P(H_{0,1}|Y_k(m),\{F\}) P(H0,1∣Yk(m),{F}),表达式如 ( 1 − 15 ) (1-15) (1−15)所示
- 噪声抑制
当噪声更新完成后,就可以使用维纳滤波器对噪声进行抑制,常见维纳滤波器的表达式为:
H w ( k , m ) = ⟨ ∣ X k ( m ) ∣ 2 ⟩ ⟨ ∣ Y k ( m ) ∣ 2 ⟩ = 1 − ⟨ ∣ N k ( m ) ∣ 2 ⟩ ⟨ ∣ Y k ( m ) ∣ 2 ⟩ ≈ 1 − ∣ N ^ k ( m ) ∣ 2 ∣ Y k ( m ) ∣ 2 ( 1 − 22 ) H_w(k,m)=\displaystyle{\frac{\langle |X_k(m)|^2\rangle}{\langle|Y_k(m)|^2\rangle}=1-\frac{\langle|N_k(m)|^2\rangle}{\langle|Y_k(m)|^2\rangle}\approx 1-\frac{|\hat{N}_k(m)|^2}{|Y_k(m)|^2}}\ \ \ (1-22) Hw(k,m)=⟨∣Yk(m)∣2⟩⟨∣Xk(m)∣2⟩=1−⟨∣Yk(m)∣2⟩⟨∣Nk(m)∣2⟩≈1−∣Yk(m)∣2∣N^k(m)∣2 (1−22)
若模长代替平方,上式变为:
H w ( k , m ) = 1 − ∣ N ^ k ( m ) ∣ ∣ Y k ( m ) ∣ ( 1 − 23 ) H_w(k,m)=\displaystyle{1-\frac{|\hat{N}_k(m)|}{|Y_k(m)|}}\ \ \ (1-23) Hw(k,m)=1−∣Yk(m)∣∣N^k(m)∣ (1−23)
在本算法中,维纳滤波器用先验SNR表示:
H k ( m ) = ρ k ( m ) 1 + ρ k ( m ) ( 1 − 24 ) H_k(m)=\displaystyle{\frac{\rho_k(m)}{1+\rho_k(m)}}\ \ \ (1-24) Hk(m)=1+ρk(m)ρk(m) (1−24)
ρ k ( m ) \rho_k(m) ρk(m)为前文中定义的先验SNR,根据定义 ρ k ( m ) \rho_k(m) ρk(m)为取平滑后的先验信噪比,所以次维纳滤波器也经过了时间平滑,因此不再另外单独进行平滑。将噪声频谱替换为估计出的噪声频谱则 ( 1 − 6 ) (1-6) (1−6)变为:
ρ k ( m ) = ⟨ ∣ X k ( m ) ∣ ⟩ ∣ N ^ k ( m ) ∣ \rho_k(m)=\displaystyle{\frac{\langle|X_k(m)|\rangle}{|\hat N_k(m)|}} ρk(m)=∣N^k(m)∣⟨∣Xk(m)∣⟩,进而 ( 1 − 7 ) (1-7) (1−7)也发生类似变化。根据判决引导(DD, decision directed)最终取:
H w , d d ( k , m ) = max { H m i n , ρ k ( m ) β + ρ k ( m ) } ( 1 − 25 ) H_{w,dd}(k,m)=\displaystyle{ \max{\{H_{min},\frac{\rho_k(m)}{\beta +\rho_k(m)}\}}} \ \ \ (1-25) Hw,dd(k,m)=max{Hmin,β+ρk(m)ρk(m)} (1−25)
其中 β \beta β是控制降噪程度的参数,最后,对输入频谱应用维纳滤波器得到降噪后的频谱:
X ^ k ( m ) = H w , d d ( k , m ) Y k ( m ) ( 1 − 26 ) \hat{X}_k(m)=H_{w,dd}(k,m)Y_k(m)\ \ \ (1-26) X^k(m)=Hw,dd(k,m)Yk(m) (1−26)
然后进行逆傅里叶变化就可得到降噪后的时域信号。
从下面开始便是代码计算流程分析。
-
模块初始化
webrtc的降噪代码主要写在noise_suppression.c中,开始降噪之前首先要进行一些模块初始化操作,实际执行初始化的函数为WebRtcNs_InitCore(Noise SuppressionC *self, uint32_t fs)
。本算法仅支持采样率为8kHz或16kHz的音频。对于8k的音频,每次更新长度为blockLen=80
的数据,与先前的数据拼接成anaLen=128
长度的数据进行处理,我的理解即为帧移为80,帧长为128;若为16k的采样率,则上述长度变为原来的2倍。对anaLen长的数据求fft后取一半加一个点进行分析,也即magnLen = anaLen/2 + 1
(因为实信号的fft是共轭对称的)。这部分代码本身有注释,这里就不进行过多分析
噪声分析
这一部分内容包括计算LRT均值特征、频谱平坦度、频谱模板相似度(差异度)等特征,然后利用这些特征计算语音概率,之后进行噪声频谱估计的更新。执行这一步的函数为WebRtcNs_AnalyzeCore(NoiseSuppressionC *self, const int16_t *speechFrame)
,整个流程大致如下面框图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-exqEBhkS-1638020359765)(C:\Users\user\Desktop\Note\Note_pic\webrtc ns noise analysis.png)]
-
数据准备
每一次读入长为80个samples的数据,然后对分析帧进行更新,也就是将analyze的最新80个点换为读入的数据:
UpdateBuffer(speechFrame, self->blockLen, self->anaLen, self->analyzeBuf);
而后对analyzeBuf进行加窗并计算能量,如果能量为0则直接退出,不进行
WebRtcNs_AnalyzeCore
的后续计算:energy = WindowingEnergy(self->window, self->analyzeBuf, self->anaLen, winData); if (energy == 0.0) { return; }
-
数据分析
-
傅里叶变换
FFT(self, winData, self->anaLen, self->magnLen, real, imag, magn, lmagn, 1, &signalEnergy, &sumMagn);
这一步还同时计算了信号能量,其值为频谱幅值平方的和
-
计算能量与幅值相关数据
//能量归一化,除以帧长(self->normMagnLen=1/MagnLen) signalEnergy *= self->normMagnLen; self->signalEnergy = signalEnergy; //计算幅值之和 self->sumMagn = sumMagn;
-
-
分位数噪声估计(Quantile noise estimate)
此步骤的主要目的是对输入帧进行一个初始的噪声估计,使用“分位数噪声估计”。分位数噪声估计的基本思想是:对于某一个特定频点,将所有时间帧在该频点的幅度保存下来并排序。设置一个分位值,低于分位值的被视为噪声,高于分位值的被视为语音。但在实际计算中,为了保证实时性,不可能等收集几百帧数据后再排序取分位数,因而再webrtc算法中,采用了自适应的分数更新:设置一个对数分位值lquantile作为初始对数噪声,每当输入帧的幅值大于该分位值时,说明分位值偏小,使其增加,反之则使其减少:
if (lmagn[i] > self->lquantile[offset + i]) { self->lquantile[offset + i] += QUANTILE * delta * norm_counter_weight; } else { self->lquantile[offset + i] -= (1.f - QUANTILE) * delta * norm_counter_weight; }
可以看出
delta
和norm_counter_weight
控制着更新步长。首先看delta
:if (self->density[offset + i] > 1.0) { delta = FACTOR / self->density[offset + i]; } else { delta = FACTOR; }
density
可以理解为概率密度,用来衡量每个频点分位数估计的准确程度,其更新如下:if (fabsf(lmagn[i] - self->lquantile[offset + i]) < WIDTH) { self->density[offset + i] = self->counter[s] * self->density[offset + i] * norm_counter_weight + density_plus_weight; }
即当
lmgn
和lquantile
足够接近时,认为当前频点的噪声估计值lquantile
比较精确,则增加此频点的概率密度。再回看前面的delta
,当density
大于1时,delta
减小。也即:在当前频点的噪声估计值比较准确时(density
>1),为了更精细化的估计,使更新步长减小(类似于神经网络训练的学习率逐渐下降)。再看
norm_counter_weight
,此变量实际上是counter
的倒数。counter
是一个长度为3的数组,初始值为[67,134,1]
。程序中实际对每个频点设置了三个分位点(lquantile
),三个counter[i]
分别是三个分位点的计数器——每一次噪声估计都加一,达到200时重新置零。再回到前面的norm_counter_weght
,则显然当累计的帧数越多时,此值越小,也即lquantile
的更新步长越小,含义和之前类似:先进行大范围搜索,当数据逐渐增多时,缩小步长以达到收敛的目的。噪声更新的循环问题:函数中有一个参数为
self->updates
,在前两百帧,每执行一次噪声估计,此参数加一:if (self->updates < END_STARTUP_LONG) { self->updates++; }
当
self->updates
小于200时,也即处于前两百帧时,每一帧都更新系统分位数(也即结构体self->quantile
):if (self->updates < END_STARTUP_LONG) { for (i = 0; i < self->magnLen; i++) { self->quantile[i] = expf(self->lquantile[offset + i]); } memcpy(noise, self->quantile, self->magnLen * sizeof(*noise)); } else { memcpy(noise, self->quantile, self->magnLen * sizeof(*noise)); }
当
self->updates
大于200时,每66帧才更新一次系统分位数:if (self->counter[s] >= END_STARTUP_LONG) { self->counter[s] = 0; if (self->updates >= END_STARTUP_LONG) { for (i = 0; i < self->magnLen; i++) { self->quantile[i] = expf(self->lquantile[offset + i]); } } }
66的来历:
counter
是一个长度为3初始值为[67,134,1]
的数组,每一帧处理后三个值都加一。则每过66帧,就会有一个counter[i]
的值大于或等于END_STARTUP_LONG
(=200),将此counter[i]
的值置零,然后更新系统分位数。输出噪声
noise
:从上上个代码块中的内容可以看出,无论如何,输出的噪声估计都是系统分位值。几个值得注意的点:
- 前面所说的分位数都指:分位点所对应的噪声幅度水平,而不是一个表示比例的数。
lquantile
和quantile
是指数与对数关系 - 变量
offset
只是为了使得分位数不用设置为三维数组。如counter[1]
所对应的分位数便是lquantile[128: 256]
,则调整offset
即可取出不同counter
对应的分位值 - 最后输出的噪声
noise
只是对噪声的一个初始估计,后面还会更新
- 前面所说的分位数都指:分位点所对应的噪声幅度水平,而不是一个表示比例的数。
-
一个简单的噪声模型(simplified noise model)的计算
当处理的帧数小于50的时候,为了更加准确地估计噪声,webrtc使用已经处理过的帧(index记为num_analyzed_frames)建立了一个简单的噪声模型。在处理index为 50-num_analyzed_frames 的帧数据时,便在分位数估计的基础之上使用此噪声模型进行噪声的估计,下面叙述流程:
在前五十帧时,计算一些需要的参数:
//这里END_STARTUP_SHORT= 50 if (self->blockInd < END_STARTUP_SHORT) { //这里舍弃前kStartBand个频点,也即低频处的若干个频点不参与计算 for (i = kStartBand; i < self->magnLen; i++) { sum_log_i += self->log_lut[i]; sum_log_i_square += self->log_lut_sqr[i]; sum_log_magn += lmagn[i]; sum_log_i_log_magn += self->log_lut[i] * lmagn[i]; } }
然后计算白噪声:
//参数overdrive是衡量去噪水平的量,和初始设定的去噪等级有关 self->whiteNoiseLevel += sumMagn * self->normMagnLen * self->overdrive;
计算粉红噪声相关参数,一个分子参数为, i i i表示当前帧的index:
p i n k _ n o i s e _ n u m e r a t o r = ∑ 0 i max ( s u m _ l o g _ i _ s q u a r e ∗ s u m _ l o g _ m a g n − s u m _ l o g _ i ∗ s u m _ l o g _ o _ l o g _ m g a n s u m _ l o g _ i _ s q u a r e ∗ ( 129 − 5 ) − s u m _ l o g _ i ∗ s u m _ l o g _ i , 0 ) pink\_noise\_numerator=\displaystyle{\sum_0^{i}\max {(\frac{sum\_log\_i\_square*sum\_log\_magn-sum\_log\_i*sum\_log\_o\_log\_mgan }{sum\_log\_i\_square*(129-5)-sum\_log\_i*sum\_log\_i},0)}} pink_noise_numerator=0∑imax(sum_log_i_square∗(129−5)−sum_log_i∗sum_log_isum_log_i_square∗sum_log_magn−sum_log_i∗sum_log_o_log_mgan,0)
其中 ( 129 − 5 ) (129-5) (129−5)指忽略前5个频点(下同),一个指数参数为:
p i n k _ n o i s e _ e x p = ∑ 0 i max ( min ( s u m _ l o g _ i ∗ s u m _ l o g _ m a g n − ( 129 − 5 ) ∗ s u m _ l o g _ i _ l o g _ m a g n s u m _ l o g _ i _ s q u a r e ∗ ( 129 − 5 ) − s u m _ l o g _ i ∗ s u m _ l o g _ i , 1 ) , 0 ) pink\_noise\_exp=\displaystyle{\sum_0^i\max{(\min(\frac{sum\_log\_i*sum\_log\_magn-(129-5)*sum\_log\_i\_log\_magn}{sum\_log\_i\_square*(129-5)-sum\_log\_i*sum\_log\_i},1),0)}} pink_noise_exp=0∑imax(min(sum_log_i_square∗(129−5)−sum_log_i∗sum_log_isum_log_i∗sum_log_magn−(129−5)∗sum_log_i_log_magn,1),0)
粉红噪声参数利用上面两个参数继续计算如下:
p a r a m e t r i c _ n u m = ( n u m _ a n a l y z e d _ f r a m e s + 1.0 ) × exp p i n k _ n o i s e _ n u m e r a t o r n u m _ a n a l y z e d _ f r a m e s + 1 parametric\_num=(num\_analyzed\_frames+1.0)\times\exp{^{\frac{pink\_noise\_numerator}{num\_analyzed\_frames+1}}} parametric_num=(num_analyzed_frames+1.0)×expnum_analyzed_frames+1pink_noise_numerator
p a r a m e t r i c _ e x p = p i n k _ n o i s e _ e x p n u m _ a n a l y z e d _ f r a m e s + 1 parametric\_exp=\displaystyle{\frac{pink\_noise\_exp}{num\_analyzed\_frames+1}} parametric_exp=num_analyzed_frames+1pink_noise_exp
然后利用粉红噪声参数和白噪声参数进行背景噪声估计:
p a r a m e t r i c _ n o i s e _ s p e c t r u m [ i ] = { w h i t e _ n o i s e _ l e v e l , p i n k _ n o i s e _ e x p = 0 p a r a m e t r i c _ n u m 2 p a r a m e t r i c _ e x p × log 2 ( u s e _ b a n d ) , e l s e parametric\_noise\_spectrum[i]=\begin{cases}white\_noise\_level,&pink\_noise\_exp=0\\\displaystyle{\frac{parametric\_num}{2^{parametric\_exp\times \log_2(use\_band)}}},&else\end{cases} parametric_noise_spectrum[i]=⎩⎨⎧white_noise_level,2parametric_exp×log2(use_band)parametric_num,pink_noise_exp=0else
其中:
u s e _ b a n d = { 5 , i < 5 i 5 ≤ i < 129 use\_band=\begin{cases}5,&i<5\\i&5\le i<129\end{cases} use_band={5,ii<55≤i<129
之后,根据噪声模型结合分位数噪声估计的方法及逆行噪声估计,注意,这些操作只在前50帧进行,超过50帧时,只进行上文提到的分位数噪声估计。
代码为:
if (self->pinkNoiseExp == 0.f) { for (i = 0; i < self->magnLen; i++) { //如果粉红噪声为0,使用白噪声参数估计背景噪声参数 self->parametricNoise[i] = self->whiteNoiseLevel; // Weight quantile noise with modeled noise. noise[i] *= (self->blockInd); tmpFloat2 = self->parametricNoise[i] * (END_STARTUP_SHORT - self->blockInd); noise[i] += tmpFloat2 * norm; noise[i] *= norm_end; } } else { // 粉红噪声相关参数计算 parametric_num = expf(self->pinkNoiseNumerator * norm); parametric_num *= (float) (self->blockInd + 1); parametric_exp = self->pinkNoiseExp * norm; for (i = 0; i < self->magnLen; i++) { //否则使用使用粉红噪声参数估计背景噪声参数 float use_band = (float) (i < kStartBand ? kStartBand : i); self->parametricNoise[i] = parametric_num / powf(use_band, parametric_exp); // Weight quantile noise with modeled noise. noise[i] *= (self->blockInd); tmpFloat2 = self->parametricNoise[i] * (END_STARTUP_SHORT - self->blockInd); noise[i] += tmpFloat2 * norm; noise[i] *= norm_end; } }
-
计算信噪比(SNR)
使用
ComputeSnr()
函数计算先验信噪比和后验信噪比,它们可以用来计算前文所述的LRT参数,进而计算语音存在概率,计算如下:static void ComputeSnr(const NoiseSuppressionC *self, const float *magn, const float *noise, float *snrLocPrior, float *logSnrLocPrior, float *snrLocPost) { size_t i; for (i = 0; i < self->magnLen; i++) { // Previous post SNR. // Previous estimate: based on previous frame with gain filter. float previousEstimateStsa = (self->magnPrevAnalyze[i] * self->smooth[i]) / (self->noisePrev[i] + epsilon); // Post SNR. snrLocPost[i] = 0.f; if (magn[i] > noise[i]) { snrLocPost[i] = (magn[i] - noise[i]) / (noise[i] + epsilon); } // DD estimate is sum of two terms: current estimate and previous estimate. // Directed decision update of snrPrior. snrLocPrior[i] = 2.f * ( DD_PR_SNR * previousEstimateStsa + (1.f - DD_PR_SNR) * snrLocPost[i]); logSnrLocPrior[i] = log1pf(snrLocPrior[i]); } // End of loop over frequencies. }
-
数据更新
首先使用
FeatureUpdate()
函数对频谱坦度、LRT和频谱模板相似度三个特征进行更新,分为两种情况:a. 帧数小于200时:即还处于噪声阶段,关键之处在于初始噪声的估计,只有模板相似度会进行更新
b. 帧数大于200时:可能有语音存在,初始的噪声估计也已经完成,三个特征都进行更新
-
频谱平坦度计算
计算原理见式 ( 1 − 9 ) (1-9) (1−9),在
FeatureUpdate()
函数中调用ComputeSpectralFlatness()
函数计算频谱平坦度:s p e c t r a l _ t m p = e a v g _ s p e c _ f l a t n e s s _ n u m a v g _ s p e c t _ f l a t n e s s _ d e n o m = e ∑ i = 0 128 ln ( s i g n a l _ s p e c t r u m [ i ] ) 129 s i g n a l _ s p e c t r a l _ s u m − s i g n a l _ s p e c t r u m [ 0 ] 129 = ∏ k ∣ Y k ∣ 1 129 1 129 ∑ k = 0 129 Y k spectral\_tmp=\displaystyle{\frac{e^{avg\_spec\_flatness\_num}}{avg\_spect\_flatness\_denom}}\\ =\frac{e^{\frac{\sum_{i=0}^{128}\ln(signal\_spectrum[i])}{129}}}{\frac{signal\_spectral\_sum-signal\_spectrum[0]}{129}}=\frac{\prod_k|Y_k|^{\frac{1}{129}}}{\frac{1}{129}\sum_{k=0}^{129}Y_k} spectral_tmp=avg_spect_flatness_denomeavg_spec_flatness_num=129signal_spectral_sum−signal_spectrum[0]e129∑i=0128ln(signal_spectrum[i])=1291∑k=0129Yk∏k∣Yk∣1291
然后进行时间平滑更新并储存至
self->featureData[0]
中,具体代码如下:static void ComputeSpectralFlatness(NoiseSuppressionC *self, const float *magnIn, const float *logmagnIn) { size_t i; size_t shiftLP = 1; // Option to remove first bin(s) from spectral measures. float avgSpectralFlatnessNum, avgSpectralFlatnessDen, spectralTmp; // Compute spectral measures. // For flatness. avgSpectralFlatnessNum = 0; avgSpectralFlatnessDen = self->sumMagn; for (i = 0; i < shiftLP; i++) { avgSpectralFlatnessDen -= magnIn[i]; } // Compute log of ratio of the geometric to arithmetic mean: check for log(0) // case. for (i = shiftLP; i < self->magnLen; i++) { if (magnIn[i] > 0.0) { avgSpectralFlatnessNum += logmagnIn[i]; } else { self->featureData[0] -= SPECT_FL_TAVG * self->featureData[0]; return; } } // Normalize. avgSpectralFlatnessDen = avgSpectralFlatnessDen * self->normMagnLen; avgSpectralFlatnessNum = avgSpectralFlatnessNum * self->normMagnLen; // Ratio and inverse log: check for case of log(0). spectralTmp = expf(avgSpectralFlatnessNum) / avgSpectralFlatnessDen; // Time-avg update of spectral flatness feature. self->featureData[0] += SPECT_FL_TAVG * (spectralTmp - self->featureData[0]); // Done with flatness feature. }
-
频谱模板相似度计算
这个特征可以衡量输入频谱和算法学习到的噪声的差异,原理见 ( 1 − 10 ) (1-10) (1−10)。代码中首先计算噪声以及输入信号的平均值和方差,接着计算输入信号和噪声的协方差,然后:
s p e c t r a l _ d i f f = s i g n a l _ v a r i a n c e − c o v a r i a n c e 2 n o i s e _ v a r i a n c e + 0.0001 = σ Y − C o v ( Y , N ) 2 σ N spectral\_diff=signal\_variance-\displaystyle{\frac{covariance^2}{noise\_variance+0.0001}=\sigma_Y-\frac{Cov(Y,N)^2}{\sigma_N}} spectral_diff=signal_variance−noise_variance+0.0001covariance2=σY−σNCov(Y,N)2
之后对其进行归一化,然后类似地,做时间平滑,最后储存在
self->featureData[4]
中,具体代码为:static void ComputeSpectralDifference(NoiseSuppressionC *self, const float *magnIn) { // avgDiffNormMagn = var(magnIn) - cov(magnIn, magnAvgPause)^2 / // var(magnAvgPause) size_t i; float avgPause, avgMagn, covMagnPause, varPause, varMagn, avgDiffNormMagn; avgPause = 0; avgMagn = self->sumMagn; // Compute average quantities. for (i = 0; i < self->magnLen; i++) { // Conservative smooth noise spectrum from pause frames. avgPause += self->magnAvgPause[i]; } avgPause *= self->normMagnLen; avgMagn *= self->normMagnLen; covMagnPause = 0; varPause = 0; varMagn = 0; // Compute variance and covariance quantities. for (i = 0; i < self->magnLen; i++) { const float avgPauseDiff = self->magnAvgPause[i] - avgPause; const float avgMagnDiff = magnIn[i] - avgMagn; covMagnPause += avgMagnDiff * avgPauseDiff; varPause += avgPauseDiff * avgPauseDiff; varMagn += avgMagnDiff * avgMagnDiff; } covMagnPause *= self->normMagnLen; varPause *= self->normMagnLen; varMagn *= self->normMagnLen; // Update of average magnitude spectrum. self->featureData[6] += self->signalEnergy; avgDiffNormMagn = varMagn - (covMagnPause * covMagnPause) / (varPause + epsilon); // Normalize and compute time-avg update of difference feature. avgDiffNormMagn = avgDiffNormMagn / (self->featureData[5] + epsilon); self->featureData[4] += SPECT_DIFF_TAVG * (avgDiffNormMagn - self->featureData[4]); }
-
更新直方图(histogram)
直方图(histogram)中有参数的阈值和权重,每500帧提取一次,用于更新模型参数。计算分位两部分,当帧数不是500的整数倍时(对应代码中
self->modelUpdatePars[3]>0
),只对直方图内的参数进行更新;当帧数为500的整数倍时(对应代码中self->modelUpdatePars[3]==0
),更新整个模型的参数,并将直方图中的参数重置。直方图更新时,会更新频谱坦度、LRT、模板相似度三个参数。参数更新调用的是
FeatureParameterExtraction()
函数,其有两个参数,第一个参数为主结构体,第二个参数为flag,表示是否到达500帧。其运算过程如下: a. 当flag为0时,只更新直方图本身,即更新LRT, flatness, difference各三个参数,分别储存在:
self->featureData[3]
,self->featureData[0]
,self->featureData[4]
中 b. 当flag为1时,更新整个模型的参数:
i) LRT特征:利用了平均的思想,计算如下
b i n _ m i d = ( i + 0.5 ) ∗ k B i n S i z e L r t bin\_mid=(i+0.5)*kBinSizeLrt bin_mid=(i+0.5)∗kBinSizeLrt
a v e r a g e = ∑ i = 0 9 [ L r t _ h i s t o g r a m [ i ] × ( i + 0.5 ) × k B i n S i z e L r t ] ∑ i = 0 9 l r t h i s t o g r a m [ i ] average=\displaystyle{\frac{\sum_{i=0}^9[Lrt\_histogram[i]\times(i+0.5)\times kBinSizeLrt]}{\sum_{i=0}^9 lrt_histogram[i]}} average=∑i=09lrthistogram[i]∑i=09[Lrt_histogram[i]×(i+0.5)×kBinSizeLrt]
a v e r a g e _ s q u a r e d = ∑ i = 0 999 ( L r t _ h i s t o g r a m [ i ] × [ ( i + 0.5 ) × k B i n S i z e L r t ] 2 ) k F e a t u r e U p d a t e W i n d o w S i z e average\_squared=\displaystyle{\frac{\sum_{i=0}^{999}(Lrt\_histogram[i]\times[(i+0.5)\times kBinSizeLrt]^2)}{kFeatureUpdateWindowSize}} average_squared=kFeatureUpdateWindowSize∑i=0999(Lrt_histogram[i]×[(i+0.5)×kBinSizeLrt]2)
a v e r a g e _ c o m p l = ∑ i = 0 999 ( L r t _ h i s t o g r a m [ i ] × ( i + 0.5 ) × k B i n S i z e L r t ) k F e a t u r e U p d a t e W i n d o w S i z e average\_compl=\displaystyle{\frac{\sum_{i=0}^{999}(Lrt\_histogram[i]\times(i+0.5)\times kBinSizeLrt)}{kFeatureUpdateWindowSize}} average_compl=kFeatureUpdateWindowSize∑i=0999(Lrt_histogram[i]×(i+0.5)×kBinSizeLrt)
然后计算LRT的波动程度:
l o w _ l r t _ f l u c t u a t i o n s = a v e r a g e _ s q u a r e r d − a v e r a g e ∗ a v e r a g e _ c o m p l low\_lrt\_fluctuations=average\_squarerd-average*average\_compl low_lrt_fluctuations=average_squarerd−average∗average_compl
然后根据此波动程度,更新模型的LRT阈值:
p r i o r _ m o d e l _ l r t = { k M a x L r t = 1.0 , l o w _ l r t _ f l u c t u a t i o n = t r u e ( 波 动 较 小 类 似 噪 声 ) 1.2 × a v e r a g e , e l s e ( 限 定 1.2 ∗ a v e r a g e ∈ [ 0.2 , 1 ] ) prior\_model\_lrt = \begin{cases}kMaxLrt=1.0,&low\_lrt\_fluctuation=true(波动较小类似噪声)\\1.2\times average,&else(限定1.2*average\in[0.2,1])\end{cases} prior_model_lrt={kMaxLrt=1.0,1.2×average,low_lrt_fluctuation=true(波动较小类似噪声)else(限定1.2∗average∈[0.2,1])
ii)频谱坦度特征:对于频谱平坦度特征,先计算其两个主要峰值以及峰值的位置,如果最高峰和第二 高峰相邻,并且第二峰峰值超过最高峰峰值的一半,则认为最高峰的位置在这两个峰之间,峰值为这 两个峰的峰值之和。之后,计算权值(flatness在三个特征中所占的权重),然后更新整个模型的 flatness参数
iii)频谱相似度特征:与平坦度特征类似,都是求峰值
之后计算每个特征的权重,然后将直方图重置。
FeatureParameterExtraction()
代码如下:static void FeatureParameterExtraction(NoiseSuppressionC *self, int flag) { int i, useFeatureSpecFlat, useFeatureSpecDiff, numHistLrt; int maxPeak1, maxPeak2; int weightPeak1SpecFlat, weightPeak2SpecFlat, weightPeak1SpecDiff, weightPeak2SpecDiff; float binMid, featureSum; float posPeak1SpecFlat, posPeak2SpecFlat, posPeak1SpecDiff, posPeak2SpecDiff; float fluctLrt, avgHistLrt, avgSquareHistLrt, avgHistLrtCompl; // 3 features: LRT, flatness, difference. // lrt_feature = self->featureData[3]; // flat_feature = self->featureData[0]; // diff_feature = self->featureData[4]; // Update histograms. if (flag == 0) { // LRT if ((self->featureData[3] < HIST_PAR_EST * self->featureExtractionParams.binSizeLrt) && (self->featureData[3] >= 0.0)) { i = (int) (self->featureData[3] / self->featureExtractionParams.binSizeLrt); self->histLrt[i]++; } // Spectral flatness. if ((self->featureData[0] < HIST_PAR_EST * self->featureExtractionParams.binSizeSpecFlat) && (self->featureData[0] >= 0.0)) { i = (int) (self->featureData[0] / self->featureExtractionParams.binSizeSpecFlat); self->histSpecFlat[i]++; } // Spectral difference. if ((self->featureData[4] < HIST_PAR_EST * self->featureExtractionParams.binSizeSpecDiff) && (self->featureData[4] >= 0.0)) { i = (int) (self->featureData[4] / self->featureExtractionParams.binSizeSpecDiff); self->histSpecDiff[i]++; } } // Extract parameters for speech/noise probability. if (flag == 1) { // LRT feature: compute the average over // self->featureExtractionParams.rangeAvgHistLrt. avgHistLrt = 0; avgHistLrtCompl = 0; avgSquareHistLrt = 0; numHistLrt = 0; for (i = 0; i < HIST_PAR_EST; i++) { binMid = ((float) i + 0.5f) * self->featureExtractionParams.binSizeLrt; if (binMid <= self->featureExtractionParams.rangeAvgHistLrt) { avgHistLrt += self->histLrt[i] * binMid; numHistLrt += self->histLrt[i]; } avgSquareHistLrt += self->histLrt[i] * binMid * binMid; avgHistLrtCompl += self->histLrt[i] * binMid; } if (numHistLrt > 0) { avgHistLrt = avgHistLrt / ((float) numHistLrt); } avgHistLrtCompl = avgHistLrtCompl / ((float) self->modelUpdatePars[1]); avgSquareHistLrt = avgSquareHistLrt / ((float) self->modelUpdatePars[1]); fluctLrt = avgSquareHistLrt - avgHistLrt * avgHistLrtCompl; // Get threshold for LRT feature. if (fluctLrt < self->featureExtractionParams.thresFluctLrt) { // Very low fluctuation, so likely noise. self->priorModelPars[0] = self->featureExtractionParams.maxLrt; } else { self->priorModelPars[0] = self->featureExtractionParams.factor1ModelPars * avgHistLrt; // Check if value is within min/max range. if (self->priorModelPars[0] < self->featureExtractionParams.minLrt) { self->priorModelPars[0] = self->featureExtractionParams.minLrt; } if (self->priorModelPars[0] > self->featureExtractionParams.maxLrt) { self->priorModelPars[0] = self->featureExtractionParams.maxLrt; } } // Done with LRT feature. // For spectral flatness and spectral difference: compute the main peaks of // histogram. maxPeak1 = 0; maxPeak2 = 0; posPeak1SpecFlat = 0; posPeak2SpecFlat = 0; weightPeak1SpecFlat = 0; weightPeak2SpecFlat = 0; // Peaks for flatness. for (i = 0; i < HIST_PAR_EST; i++) { binMid = (i + 0.5f) * self->featureExtractionParams.binSizeSpecFlat; if (self->histSpecFlat[i] > maxPeak1) { // Found new "first" peak. maxPeak2 = maxPeak1; weightPeak2SpecFlat = weightPeak1SpecFlat; posPeak2SpecFlat = posPeak1SpecFlat; maxPeak1 = self->histSpecFlat[i]; weightPeak1SpecFlat = self->histSpecFlat[i]; posPeak1SpecFlat = binMid; } else if (self->histSpecFlat[i] > maxPeak2) { // Found new "second" peak. maxPeak2 = self->histSpecFlat[i]; weightPeak2SpecFlat = self->histSpecFlat[i]; posPeak2SpecFlat = binMid; } } // Compute two peaks for spectral difference. maxPeak1 = 0; maxPeak2 = 0; posPeak1SpecDiff = 0; posPeak2SpecDiff = 0; weightPeak1SpecDiff = 0; weightPeak2SpecDiff = 0; // Peaks for spectral difference. for (i = 0; i < HIST_PAR_EST; i++) { binMid = ((float) i + 0.5f) * self->featureExtractionParams.binSizeSpecDiff; if (self->histSpecDiff[i] > maxPeak1) { // Found new "first" peak. maxPeak2 = maxPeak1; weightPeak2SpecDiff = weightPeak1SpecDiff; posPeak2SpecDiff = posPeak1SpecDiff; maxPeak1 = self->histSpecDiff[i]; weightPeak1SpecDiff = self->histSpecDiff[i]; posPeak1SpecDiff = binMid; } else if (self->histSpecDiff[i] > maxPeak2) { // Found new "second" peak. maxPeak2 = self->histSpecDiff[i]; weightPeak2SpecDiff = self->histSpecDiff[i]; posPeak2SpecDiff = binMid; } } // For spectrum flatness feature. useFeatureSpecFlat = 1; // Merge the two peaks if they are close. if ((fabsf(posPeak2SpecFlat - posPeak1SpecFlat) < self->featureExtractionParams.limitPeakSpacingSpecFlat) && (weightPeak2SpecFlat > self->featureExtractionParams.limitPeakWeightsSpecFlat * weightPeak1SpecFlat)) { weightPeak1SpecFlat += weightPeak2SpecFlat; posPeak1SpecFlat = 0.5f * (posPeak1SpecFlat + posPeak2SpecFlat); } // Reject if weight of peaks is not large enough, or peak value too small. if (weightPeak1SpecFlat < self->featureExtractionParams.thresWeightSpecFlat || posPeak1SpecFlat < self->featureExtractionParams.thresPosSpecFlat) { useFeatureSpecFlat = 0; } // If selected, get the threshold. if (useFeatureSpecFlat == 1) { // Compute the threshold. self->priorModelPars[1] = self->featureExtractionParams.factor2ModelPars * posPeak1SpecFlat; // Check if value is within min/max range. if (self->priorModelPars[1] < self->featureExtractionParams.minSpecFlat) { self->priorModelPars[1] = self->featureExtractionParams.minSpecFlat; } if (self->priorModelPars[1] > self->featureExtractionParams.maxSpecFlat) { self->priorModelPars[1] = self->featureExtractionParams.maxSpecFlat; } } // Done with flatness feature. // For template feature. useFeatureSpecDiff = 1; // Merge the two peaks if they are close. if ((fabsf(posPeak2SpecDiff - posPeak1SpecDiff) < self->featureExtractionParams.limitPeakSpacingSpecDiff) && (weightPeak2SpecDiff > self->featureExtractionParams.limitPeakWeightsSpecDiff * weightPeak1SpecDiff)) { weightPeak1SpecDiff += weightPeak2SpecDiff; posPeak1SpecDiff = 0.5f * (posPeak1SpecDiff + posPeak2SpecDiff); } // Get the threshold value. self->priorModelPars[3] = self->featureExtractionParams.factor1ModelPars * posPeak1SpecDiff; // Reject if weight of peaks is not large enough. if (weightPeak1SpecDiff < self->featureExtractionParams.thresWeightSpecDiff) { useFeatureSpecDiff = 0; } // Check if value is within min/max range. if (self->priorModelPars[3] < self->featureExtractionParams.minSpecDiff) { self->priorModelPars[3] = self->featureExtractionParams.minSpecDiff; } if (self->priorModelPars[3] > self->featureExtractionParams.maxSpecDiff) { self->priorModelPars[3] = self->featureExtractionParams.maxSpecDiff; } // Done with spectral difference feature. // Don't use template feature if fluctuation of LRT feature is very low: // most likely just noise state. if (fluctLrt < self->featureExtractionParams.thresFluctLrt) { useFeatureSpecDiff = 0; } // Select the weights between the features. // self->priorModelPars[4] is weight for LRT: always selected. // self->priorModelPars[5] is weight for spectral flatness. // self->priorModelPars[6] is weight for spectral difference. featureSum = (float) (1 + useFeatureSpecFlat + useFeatureSpecDiff); self->priorModelPars[4] = 1.f / featureSum; self->priorModelPars[5] = ((float) useFeatureSpecFlat) * self->priorModelPars[4]; self->priorModelPars[6] = ((float) useFeatureSpecDiff) * self->priorModelPars[4]; // Set hists to zero for next update. if (self->modelUpdatePars[0] >= 1) { for (i = 0; i < HIST_PAR_EST; i++) { self->histLrt[i] = 0; self->histSpecFlat[i] = 0; self->histSpecDiff[i] = 0; } } } // End of flag == 1. }
-
-
计算映射函数和语音存在概率
由
SpeechNoiseProb()
进行计算,先进行映射的计算然后再计算语音存在概率-
映射
语音存在概率时利用映射函数对特征映射后得到的,典型的映射函数有:
sinh x = e x − e − x 2 , cosh x = e x + e − x 2 , tanh x = sinh x c o s h x = e x − e − x e x + e − x \sinh x=\displaystyle{\frac{e^x-e^{-x}}{2}},\ \ \ \cosh x=\displaystyle{\frac{e^x+e^{-x}}{2}},\ \ \ \tanh x=\displaystyle{\frac{\sinh x}{cosh x}=\frac{e^x-e^{-x}}{e^x+e^{-x}}} sinhx=2ex−e−x, coshx=2ex+e−x, tanhx=coshxsinhx=ex+e−xex−e−x
LRT的映射为:
i n d i c a t o r 0 = 0.5 × [ t a n h ( w i d t h _ p r i o r × ( m o d e l . l r t − p r i o r _ m o d e l . l r t ) ) + 1 ] indicator0=0.5\times[tanh(width\_prior\times(model.lrt-prior\_model.lrt))+1] indicator0=0.5×[tanh(width_prior×(model.lrt−prior_model.lrt))+1]
flattness的映射为:
i n d i c a t o r 1 = 0.5 × [ t a n h ( 1 × w i d t h _ p r i o r × ( p r i o r _ m o d e l . f l a t n e s s _ t h r e s h o l d − m o d e l . s p e c t r a l _ f l a t n e s s ) ) + 1 ] indicator1=0.5\times[tanh(1\times width\_prior\times(prior\_model.flatness\_threshold-model.spectral\_flatness))+1] indicator1=0.5×[tanh(1×width_prior×(prior_model.flatness_threshold−model.spectral_flatness))+1]
频谱相似度的映射为:
i n d i c a t o r 2 = 0.5 × [ t a n h ( w i d t h _ p r i o r × ( m o d e l . s p e c t r a l _ d i f f − p r i o r _ m o d e l . t e m p l a t e _ d i f f _ t h r e s h o l d ) ) + 1 ] indicator2=0.5\times[tanh(width\_prior\times(model.spectral\_diff-prior\_model.template\_diff\_threshold))+1] indicator2=0.5×[tanh(width_prior×(model.spectral_diff−prior_model.template_diff_threshold))+1]
最后将不同的映射加权求和:
i n d _ p r i o r = p r i o r _ m o d e l . l r t _ w e i g h t i n g × i n d i c a t o r 0 + p r i o r _ m o d e l . f l a t n e s s _ w e i g h i n g × i n d i c a t o r 1 + p r i o r _ m o d e l . d i f f e r e n c e _ w e i g h t i n g × i n d i c a t o r 2 = 1 × i n d i c a t o r 0 + 0 × i n d i c a t o r 1 + 0 × i n d i c a t o r 2 ind\_prior=prior\_model.lrt\_weighting\times indicator0\\+prior\_model.flatness\_weighing\times indicator1\\ +prior\_model.difference\_weighting\times indicator2\\ =1\times indicator0+0\times indicator1+0\times indicator2 ind_prior=prior_model.lrt_weighting×indicator0+prior_model.flatness_weighing×indicator1+prior_model.difference_weighting×indicator2=1×indicator0+0×indicator1+0×indicator2
flatness和difference的权重实际为0,这也就是说,语音存在概率的判断完全依赖于LRT
-
语音存在概率计算
p r i o r _ s p e e c h _ p r o b _ = p r i o r _ s p e e c h _ p r o b _ + 0.1 × ( i n d _ p r i o r − p r i o r _ s p e e c h _ p r o b _ ) prior\_speech\_prob\_=prior\_speech\_prob\_+0.1\times(ind\_prior-prior\_speech\_prob\_) prior_speech_prob_=prior_speech_prob_+0.1×(ind_prior−prior_speech_prob_)
p r i o r _ s p e e c h _ p r o b _ = max ( min ( p r i o r _ s p e e c h _ p r o b _ , 1 ) , 0.01 ) prior\_speech\_prob\_=\max(\min(prior\_speech\_prob\_,1),0.01) prior_speech_prob_=max(min(prior_speech_prob_,1),0.01)
g a i n _ p r i o r = 1.0 − p r i o r _ s p e e c h _ p r o b _ p r i o r _ s p e e c h _ p r o b _ gain\_prior=\displaystyle{\frac{1.0-prior\_speech\_prob\_}{prior\_speech\_prob\_}} gain_prior=prior_speech_prob_1.0−prior_speech_prob_
i n v _ l r t [ i ] = e m o d e l . a v g _ l o g _ l r t [ i ] inv\_lrt[i]=e^{model.avg\_log\_lrt[i]} inv_lrt[i]=emodel.avg_log_lrt[i]
s p e e c h _ p r o b a b i l i t y _ [ i ] = 1 1.0 + g a i n _ p r i o r × i n v _ l r t [ i ] speech\_probability\_[i]=\displaystyle{\frac{1}{1.0+gain\_prior\times inv\_lrt[i]}} speech_probability_[i]=1.0+gain_prior×inv_lrt[i]1
-
-
噪声估计:根据语音存在概率对噪声进行平滑更新,具体计算由
UpdateNoiseEstimate()
函数完成,原理在前文“相关算法公式分析”有介绍,即:∣ N ^ k ( m ) ∣ = γ n ∣ N ^ k ( m − 1 ) ∣ + ( 1 − γ n ) [ P ( H 1 ∣ Y k ( m ) ) N ^ k ( m − 1 ) + P ( H 0 ∣ Y k ( m ) ) Y k ( m ) ] |\hat{N}_k(m)|=\gamma_n |\hat{N}_k(m-1)|+(1-\gamma_n)[P(H_1|Y_k(m))\hat{N}_k(m-1)+P(H_0|Y_k(m))Y_k(m)] ∣N^k(m)∣=γn∣N^k(m−1)∣+(1−γn)[P(H1∣Yk(m))N^k(m−1)+P(H0∣Yk(m))Yk(m)]
具体计算为:
n o i s e _ u p d a t e _ t m p = γ × p r e v _ n o i s e _ s p e c t r u m _ [ i ] + ( 1 − γ ) { ( 1 − s p e e c h _ p r o b a b i l i t y [ i ] ) × s i g n a l _ s p e c t r u m [ i ] + s p e e c h _ p r o b a b i l i t y [ i ] × p r e v _ n o i s e _ s p e c t r u m [ i ] } noise\_update\_tmp=\gamma \times prev\_noise\_spectrum\_[i]+(1-\gamma)\{(1-speech\_probability[i])\\ \times signal\_spectrum[i] +speech\_probability[i]\times prev\_noise\_spectrum[i]\} noise_update_tmp=γ×prev_noise_spectrum_[i]+(1−γ){(1−speech_probability[i])×signal_spectrum[i]+speech_probability[i]×prev_noise_spectrum[i]}
WebRTC中的 γ \gamma γ取为0.9,上市表明,当当前帧为语音时,主要使用过去帧的噪声来估计当前帧噪声
之后更新 γ \gamma γ:
γ o l d = k N o i s e U p d a t e = 0.9 \gamma_{old}=kNoiseUpdate=0.9 γold=kNoiseUpdate=0.9
γ = { 0.99 , p r o b _ s p e e c h > k _ P r o b R a n g e k N o i s e U p d a t e , e l s e \gamma=\begin{cases}0.99,&prob\_speech>k\_ProbRange\\kNoiseUpdate,&else\end{cases} γ={0.99,kNoiseUpdate,prob_speech>k_ProbRangeelse
当语音概率小于0.2时:
c o n s e r v a t i v e _ n o i s e _ s p e c t r u m _ [ i ] + = 0.05 × ( s i g n a l _ s p e c t r u m [ i ] − c o n s e r v a t i v e _ n o i s e _ s p e c t r u m _ [ i ] ) conservative\_noise\_spectrum\_[i]+=0.05\times(signal\_spectrum[i]-conservative\_noise\_spectrum\_[i]) conservative_noise_spectrum_[i]+=0.05×(signal_spectrum[i]−conservative_noise_spectrum_[i])
然后更新噪声频谱,当 γ = γ d d = 0.9 \gamma=\gamma_{dd}=0.9 γ=γdd=0.9时:
n o i s e _ s p e c t r u m _ [ i ] = n o i s e _ u p d a t e _ t m p = γ o l d × p r e v _ n o i s e _ s p e c t r u m _ [ i ] + ( 1 − γ o l d ) × { ( 1 − s p e e c h _ p r o b a b i l i t y [ i ] ) × s i g n a l _ s p e c t r u m [ i ] + s p e e c h _ p r o b a b i l i t y [ i ] × p r e v _ n o i s e _ s p e c t r u m _ [ i ] } noise\_spectrum\_[i]=noise\_update\_tmp=\gamma_{old}\times prev\_noise\_spectrum\_[i]+(1-\gamma_{old})\times \\ \{(1-speech\_probability[i])\times signal\_spectrum[i]+speech\_probability[i]\times prev\_noise\_spectrum\_[i]\} noise_spectrum_[i]=noise_update_tmp=γold×prev_noise_spectrum_[i]+(1−γold)×{(1−speech_probability[i])×signal_spectrum[i]+speech_probability[i]×prev_noise_spectrum_[i]}
当 γ = 0.99 ≠ γ o l d \gamma=0.99\neq\gamma_{old} γ=0.99=γold时:
n o i s e _ s p e c t r u m _ [ i ] = n o i s e _ u p d a t e _ t m p = γ × p r e v _ n o i s e _ s p e c t r u m _ [ i ] + ( 1 − γ ) × { ( 1 − s p e e c h _ p r o b a b i l i t y [ i ] ) × s i g n a l _ s p e c t r u m [ i ] + s p e e c h _ p r o b a b i l i t y [ i ] × p r e v _ n o i s e _ s p e c t r u m _ [ i ] } noise\_spectrum\_[i]=noise\_update\_tmp=\gamma\times prev\_noise\_spectrum\_[i]+(1-\gamma)\times \\ \{(1-speech\_probability[i])\times signal\_spectrum[i]+speech\_probability[i]\times prev\_noise\_spectrum\_[i]\} noise_spectrum_[i]=noise_update_tmp=γ×prev_noise_spectrum_[i]+(1−γ)×{(1−speech_probability[i])×signal_spectrum[i]+speech_probability[i]×prev_noise_spectrum_[i]}
且 n o i s e _ s p e c t r u m _ [ i ] = min ( n o i s e _ s p e c t r u m _ [ i ] , n o i s e _ u p d a t e _ t m p ) noise\_spectrum\_[i]=\min(noise\_spectrum\_[i],noise\_update\_tmp) noise_spectrum_[i]=min(noise_spectrum_[i],noise_update_tmp)
代码如下:
static void UpdateNoiseEstimate(NoiseSuppressionC *self, const float *magn, float *noise) { size_t i; float probSpeech, probNonSpeech; // Time-avg parameter for noise update. float gammaNoiseTmp = NOISE_UPDATE; float gammaNoiseOld; float noiseUpdateTmp; for (i = 0; i < self->magnLen; i++) { probSpeech = self->speechProb[i]; probNonSpeech = 1.f - probSpeech; // Temporary noise update: // Use it for speech frames if update value is less than previous. noiseUpdateTmp = gammaNoiseTmp * self->noisePrev[i] + (1.f - gammaNoiseTmp) * (probNonSpeech * magn[i] + probSpeech * self->noisePrev[i]); // Time-constant based on speech/noise state. gammaNoiseOld = gammaNoiseTmp; gammaNoiseTmp = NOISE_UPDATE; // Increase gamma (i.e., less noise update) for frame likely to be speech. if (probSpeech > PROB_RANGE) { gammaNoiseTmp = SPEECH_UPDATE; } // Conservative noise update. if (probSpeech < PROB_RANGE) { self->magnAvgPause[i] += GAMMA_PAUSE * (magn[i] - self->magnAvgPause[i]); } // Noise update. if (gammaNoiseTmp == gammaNoiseOld) { noise[i] = noiseUpdateTmp; } else { noise[i] = gammaNoiseTmp * self->noisePrev[i] + (1.f - gammaNoiseTmp) * (probNonSpeech * magn[i] + probSpeech * self->noisePrev[i]); // Allow for noise update downwards: // If noise update decreases the noise, it is safe, so allow it to // happen. if (noiseUpdateTmp < noise[i]) { noise[i] = noiseUpdateTmp; } } } // End of freq loop. }
噪声抑制
当完成噪声估计后,便可以使用维纳滤波对噪声进行抑制,进而实现降噪,这一部分主要是由WebRtcNs_Process()
函数调用WebRtcNs_ProcessCore()
函数实现的,主要步骤如下:
-
数据形成
每次将当前10ms对应的160个数据点和前一帧最后的96个数据点拼起来,形成长度为256的当前帧(为了方便做FFT)
-
加窗并在滤波前计算能量
-
FFT
-
计算信号幅度谱
-
计算维纳滤波器增益,下面前三步主要由
ComputeDdBasedWienerFilter()
实现-
根据过去帧计算
p r e v _ t s a = s p e c t r u m _ p r e v _ p r o c e s s _ [ i ] p r e v _ n o i s e _ s p e c t r u m [ i ] × f i l t e r _ [ i ] prev\_tsa=\displaystyle{\frac{spectrum\_prev\_process\_[i]}{prev\_noise\_spectrum[i]\times filter\_[i]}} prev_tsa=prev_noise_spectrum[i]×filter_[i]spectrum_prev_process_[i]
-
当前帧数据的计算:当信号幅度谱大于噪声的时。计算一个先验SNR
c u r r e n t _ t s a = { s i g n a l _ s p e c t r u m [ i ] n o i s e _ s p e c t r u m [ i ] − 1 , s i g n a l _ s p e c t r u m [ i ] > n o i s e _ s p e c t r u m [ i ] 0 , e l s e current\_tsa=\begin{cases}\displaystyle{\frac{signal\_spectrum[i]}{noise\_spectrum[i]}-1},&signal\_spectrum[i]>noise\_spectrum[i]\\ 0,&else\end{cases} current_tsa=⎩⎨⎧noise_spectrum[i]signal_spectrum[i]−1,0,signal_spectrum[i]>noise_spectrum[i]else
-
根据过去和当前的来计算滤波器增益
f i l t e r _ [ i ] = s n r _ p r i o r s u p p r e s i o n _ p a r a m s _ . o v e r _ s u b t r a c t i o n _ f a c t o r + s n r _ p r i o r filter\_[i]=\displaystyle{\frac{snr\_prior}{suppresion\_params\_.over\_subtraction\_factor+snr\_prior}} filter_[i]=suppresion_params_.over_subtraction_factor+snr_priorsnr_prior
f i l t e r _ [ i ] = max ( min ( f i l t e r _ [ i ] , 1.0 ) , s u p p r e s s i o n _ p a r a m s _ . m i n i m u m _ a t t e n u a t i n g _ g a i n ) filter\_[i]=\max(\min(filter\_[i],1.0),suppression\_params\_.minimum\_attenuating\_gain) filter_[i]=max(min(filter_[i],1.0),suppression_params_.minimum_attenuating_gain)
-
前50帧的情况:当帧数小于50时,认为还处于噪声阶段。此时,维纳滤波器是通过结合当前帧数据以及估计的噪声模型来计算的,具体计算如下:
f i l t e r _ i n i t i a l = ∑ 0 50 s i g n a l _ s p e c t r u m [ i ] − s u p p r e s s i o n _ p a r a m s _ o v e r . o v e r _ s u b t r a c t i o n _ f a c t o r × p a r a m e t r i c _ n o i s e _ s p e c t r u m [ i ] ∑ 0 50 s i g n a l _ s p e c t r u m [ i ] filter\_initial=\displaystyle{\frac{\sum_0^{50}signal\_spectrum[i]-suppression\_params\_over.over\_subtraction\_factor\times parametric\_noise\_spectrum[i]}{\sum_0^{50}signal\_spectrum[i]}} filter_initial=∑050signal_spectrum[i]∑050signal_spectrum[i]−suppression_params_over.over_subtraction_factor×parametric_noise_spectrum[i]
f i l t e r _ [ i ] = f i l t e r _ [ i ] × n u m _ a n a l y z e d _ f r a m e s + f i l t e r _ i n i t i a l × ( 50 − n u m _ a n a l y z e d _ f r a m e s ) 50 filter\_[i]=\displaystyle{\frac{filter\_[i]\times num\_analyzed\_frames+filter\_initial\times(50-num\_analyzed\_frames)}{50}} filter_[i]=50filter_[i]×num_analyzed_frames+filter_initial×(50−num_analyzed_frames)
-
-
进行维纳滤波
将频谱和维纳滤波器相乘
self->smooth[i] = theFilter[i]; real[i] *= self->smooth[i]; imag[i] *= self->smooth[i];
-
时域处理
-
通过IFFT将数据变换回时域
-
计算一个调整因子(scale factor),这一步只在200帧时候进行
首先计算滤波后的能量:
e n e r g y _ a f t e r _ f i l t e r i n g = ∑ i = 0 256 e x t e n d e d _ f r a m e [ i ] 2 energy\_after\_filtering=\sum_{i=0}^{256}extended\_frame[i]^2 energy_after_filtering=∑i=0256extended_frame[i]2
然后加上要进行IFFT的窗(即FFT窗的逆窗),然后根据降噪的效果,对尺度因子进行计算:
如果处于前200帧,则 g a i n _ a d j u s t m e n t [ c h ] = 1.0 gain\_adjustment[ch]=1.0 gain_adjustment[ch]=1.0,否则:
g a i n = e n e r g y _ a f t e r _ f i l t e r i n g e n e r g y _ b e f o r e _ f i l t e r i n g + 1.0 gain=\displaystyle{\sqrt{\frac{energy\_after\_filtering}{energy\_before\_filtering+1.0}}} gain=energy_before_filtering+1.0energy_after_filtering
s c a l e _ f a c t o r 1 = { 1.0 , g a i n ≤ k B L i m = 0.5 1 g a i n , g a i n × [ 1 + 1.3 × ( g a i n − k B L i m ) ] > 1 , 1 + 1.3 × ( g a i n − k B L i m ) , e l s e scale\_factor1=\begin{cases}1.0,&gain\le kBLim=0.5\\ \displaystyle{\frac{1}{gain}},&gain\times[1+1.3\times(gain-kBLim)]>1,\\1+1.3\times(gain-kBLim),&else\end{cases} scale_factor1=⎩⎪⎪⎨⎪⎪⎧1.0,gain1,1+1.3×(gain−kBLim),gain≤kBLim=0.5gain×[1+1.3×(gain−kBLim)]>1,else
s c a l e _ f a c t o r 2 = { 1 − 0.3 × [ k B L i m − max ( g a i n , s u p p r e s s i o n _ p a r a m s _ . m i n i m u m _ a t t e n u a t i n g _ g a i n ) ] , g a i n < k B L i m 1.0 , e l s e scale\_factor2=\begin{cases}1-0.3\times[kBLim-\max(gain,suppression\_params\_.minimum\_attenuating\_gain)],&gain<kBLim\\ 1.0,&else\end{cases} scale_factor2={1−0.3×[kBLim−max(gain,suppression_params_.minimum_attenuating_gain)],1.0,gain<kBLimelse
g a i n _ a d j u s t m e n t s = p r i o r _ s p e e c h _ p r o b a b i l i t y × s c a l e _ f a c t o r 1 + ( 1 − p r i o r _ s p e e c h _ p r o b a b i l i t y ) × s c a l e _ f a c t o r 2 gain\_adjustments=prior\_speech\_probability\times scale\_factor1+\\(1-prior\_speech\_probability)\times scale\_factor2 gain_adjustments=prior_speech_probability×scale_factor1+(1−prior_speech_probability)×scale_factor2
g a i n _ a d j u s t m e n t s gain\_adjustments gain_adjustments即为factor,代码如下:
// Scale factor: only do it after END_STARTUP_LONG time. factor = 1.f; if (self->gainmap == 1 && self->blockInd > END_STARTUP_LONG) { factor1 = 1.f; factor2 = 1.f; energy2 = Energy(winData, self->anaLen); gain = sqrtf(energy2 / (energy1 + epsilon) + epsilon_squ); // Scaling for new version. if (gain > B_LIM) { factor1 = 1.f + 1.3f * (gain - B_LIM); if (gain * factor1 > 1.f) { factor1 = 1.f / gain; } } if (gain < B_LIM) { // Don't reduce scale too much for pause regions: // attenuation here should be controlled by flooring. if (gain <= self->denoiseBound) { gain = self->denoiseBound; } factor2 = 1.f - 0.3f * (B_LIM - gain); } // Combine both scales with speech/noise prob: // note prior (priorSpeechProb) is not frequency dependent. factor = self->priorSpeechProb * factor1 + (1.f - self->priorSpeechProb) * factor2; } // Out of self->gainmap == 1.
-
synthesis:将factor应用于此步骤
for (i = 0; i < self->anaLen; i++) { self->syntBuf[i] += factor * winData[i] * self->window[i]; }
-
高频带增益的处理
通常来讲不必要,略。
-