Python音频信号分类MFCC特征神经网络

我正在尝试将语音信号从语音到情感进行分类.为此,我要提取音频信号的MFCC特征并将其馈入一个简单的神经网络(由PyBrain的BackpropTrainer训练的FeedForwardNetwork).不幸的是,结果非常糟糕.因此,从5个班级中,网络似乎几乎总是提出相同的班级.

我有5种情感类别和大约7000个带有标签的音频文件,我将其划分为每种情感类别的80%用于训练网络,而20%用于测试网络.

想法是使用小窗口并从中提取MFCC功能,以生成许多训练示例.在评估中,将评估来自一个文件的所有窗口,并由多数投票决定预测标签.

Training examples per class: 
{0: 81310, 1: 60809, 2: 58262, 3: 105907, 4: 73182}

Example of scaled MFCC features:
[ -6.03465056e-01   8.28665733e-01  -7.25728303e-01   2.88611116e-05
1.18677218e-02  -1.65316583e-01   5.67322809e-01  -4.92335095e-01   
3.29816126e-01  -2.52946780e-01  -2.26147779e-01   5.27210979e-01   
-7.36851560e-01]

Layers________________________:  13 20 5 (also tried 13 50 5 and 13 100 5)
Learning Rate_________________:  0.01 (also tried 0.1 and 0.3)
Training epochs_______________:  10  (error rate does not improve at all during training)

Truth table on test set:
[[   0.    4.    0.  239.   99.]
 [   0.   41.    0.  157.   23.]
 [   0.   18.    0.  173.   18.]
 [   0.   12.    0.  299.   59.]
 [   0.    0.    0.   85.  132.]]

Success rate overall [%]:  34.7314201619
Success rate Class 0 [%]:  0.0
Success rate Class 1 [%]:  18.5520361991
Success rate Class 2 [%]:  0.0
Success rate Class 3 [%]:  80.8108108108
Success rate Class 4 [%]:  60.8294930876

好的,现在,您可以看到结果在各个类中的分布非常糟糕.永远不会预测0级和2级.我认为,这暗示了我的网络或更可能是我的数据存在问题.

我可以在这里发布很多代码,但是我认为在下图中显示我为获得MFCC功能所采取的所有步骤更加有意义.请注意,我使用整个信号时没有开窗只是为了说明.这样看起来还好吗? MFCC值非常大,不是应该小很多吗? (在将它们使用minmaxscaler将它们馈送到网络之前,我将它们按比例缩小,然后将所有数据传输到[-2,2],也尝试了[0,1])

Python音频信号分类MFCC特征神经网络

这是我用于Melfilter库的代码,我在离散余弦变换之前直接应用此代码以提取MFCC特征(我从这里获得了它:*):

def freqToMel(freq):
  '''
  Calculate the Mel frequency for a given frequency 
  '''
  return 1127.01048 * math.log(1 + freq / 700.0)

def melToFreq(mel):
  '''
  Calculate the frequency for a given Mel frequency 
  '''
  return 700 * (math.exp(freq / 1127.01048 - 1))

def melFilterBank(blockSize):
  numBands = int(mfccFeatures)
  maxMel = int(freqToMel(maxHz))
  minMel = int(freqToMel(minHz))

  # Create a matrix for triangular filters, one row per filter
  filterMatrix = numpy.zeros((numBands, blockSize))

  melRange = numpy.array(xrange(numBands + 2))

  melCenterFilters = melRange * (maxMel - minMel) / (numBands + 1) + minMel

  # each array index represent the center of each triangular filter
  aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0
  aux = (numpy.exp(melCenterFilters * aux) - 1) / 22050
  aux = 0.5 + 700 * blockSize * aux
  aux = numpy.floor(aux)  # Arredonda pra baixo
  centerIndex = numpy.array(aux, int)  # Get int values

  for i in xrange(numBands):
    start, centre, end = centerIndex[i:i + 3]
    k1 = numpy.float32(centre - start)
    k2 = numpy.float32(end - centre)
    up = (numpy.array(xrange(start, centre)) - start) / k1
    down = (end - numpy.array(xrange(centre, end))) / k2

    filterMatrix[i][start:centre] = up
    filterMatrix[i][centre:end] = down

  return filterMatrix.transpose()

我该怎么做才能获得更好的预测结果?

解决方法:

在这里,我举了一个通过语音识别性别的例子.在此示例中,我使用了Hyke-dataset1.这只是一个快速实例.如果一个人想认真地进行性别认同,那么可能会做得更好.但总的来说,错误率会降低:

Build up data...
Train network...
Number of training patterns:  94956
Number of test patterns:      31651
Input and output dimensions:  13 2
Train network...
epoch:    0   train error: 62.24%   test error: 61.84%
epoch:    1   train error: 34.11%   test error: 34.25%
epoch:    2   train error: 31.11%   test error: 31.20%
epoch:    3   train error: 30.34%   test error: 30.22%
epoch:    4   train error: 30.76%   test error: 30.75%
epoch:    5   train error: 30.65%   test error: 30.72%
epoch:    6   train error: 30.81%   test error: 30.79%
epoch:    7   train error: 29.38%   test error: 29.45%
epoch:    8   train error: 31.92%   test error: 31.92%
epoch:    9   train error: 29.14%   test error: 29.23%

我使用了scikits.talkbox的MFCC实现.也许下面的代码对您有所帮助. (性别识别无疑比情感检测要容易得多.也许您需要更多不同的功能.)

import glob

from scipy.io.wavfile import read as wavread
from scikits.talkbox.features import mfcc

from pybrain.datasets            import ClassificationDataSet
from pybrain.utilities           import percentError
from pybrain.tools.shortcuts     import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules   import SoftmaxLayer

def report_error(trainer, trndata, tstdata):
    trnresult = percentError(trainer.testOnClassData(), trndata['class'])
    tstresult = percentError(trainer.testOnClassData(dataset=tstdata), tstdata['class'])
    print "epoch: %4d" % trainer.totalepochs, "  train error: %5.2f%%" % trnresult, "  test error: %5.2f%%" % tstresult  

def main(auido_path, coeffs=13):
    dataset = ClassificationDataSet(coeffs, 1, nb_classes=2, class_labels=['male', 'female'])
    male_files = glob.glob("%s/male_audio/*/*_1.wav" % auido_path)
    female_files = glob.glob("%s/female_audio/*/*_1.wav" % auido_path)

    print "Build up data..."
    for sex, files in enumerate([male_files, female_files]):
        for f in files:
            sr, signal = wavread(f)
            ceps, mspec, spec = mfcc(signal, nwin=2048, nfft=2048, fs=sr, nceps=coeffs)
            for i in range(ceps.shape[0]):
                dataset.appendLinked(ceps[i], [sex])

    tstdata, trndata = dataset.splitWithProportion(0.25)
    trndata._convertToOneOfMany()
    tstdata._convertToOneOfMany()

    print "Number of training patterns: ", len(trndata)
    print "Number of test patterns:     ", len(tstdata)
    print "Input and output dimensions: ", trndata.indim, trndata.outdim

    print "Train network..."
    fnn = buildNetwork(coeffs, int(coeffs*1.5), 2, outclass=SoftmaxLayer, fast=True)
    trainer = BackpropTrainer(fnn, dataset=trndata, learningrate=0.005)

    report_error(trainer, trndata, tstdata)
    for i in range(100):
        trainer.trainEpochs(1)
        report_error(trainer, trndata, tstdata)

if __name__ == '__main__':
    main("/path/to/hyke/audio_data")

1 Azarias Reda,Saurabh Panjwani和Edward Cutrell:Hyke:面向发展中地区的低成本远程出勤跟踪系统,第五届ACM发展中地区网络系统研讨会(NSDR).

上一篇:音频特征(1):mfcc提取


下一篇:敲开bp神经网络之门(二,mfcc中使用,c#)