我正在尝试将语音信号从语音到情感进行分类.为此,我要提取音频信号的MFCC特征并将其馈入一个简单的神经网络(由PyBrain的BackpropTrainer训练的FeedForwardNetwork).不幸的是,结果非常糟糕.因此,从5个班级中,网络似乎几乎总是提出相同的班级.
我有5种情感类别和大约7000个带有标签的音频文件,我将其划分为每种情感类别的80%用于训练网络,而20%用于测试网络.
想法是使用小窗口并从中提取MFCC功能,以生成许多训练示例.在评估中,将评估来自一个文件的所有窗口,并由多数投票决定预测标签.
Training examples per class:
{0: 81310, 1: 60809, 2: 58262, 3: 105907, 4: 73182}
Example of scaled MFCC features:
[ -6.03465056e-01 8.28665733e-01 -7.25728303e-01 2.88611116e-05
1.18677218e-02 -1.65316583e-01 5.67322809e-01 -4.92335095e-01
3.29816126e-01 -2.52946780e-01 -2.26147779e-01 5.27210979e-01
-7.36851560e-01]
Layers________________________: 13 20 5 (also tried 13 50 5 and 13 100 5)
Learning Rate_________________: 0.01 (also tried 0.1 and 0.3)
Training epochs_______________: 10 (error rate does not improve at all during training)
Truth table on test set:
[[ 0. 4. 0. 239. 99.]
[ 0. 41. 0. 157. 23.]
[ 0. 18. 0. 173. 18.]
[ 0. 12. 0. 299. 59.]
[ 0. 0. 0. 85. 132.]]
Success rate overall [%]: 34.7314201619
Success rate Class 0 [%]: 0.0
Success rate Class 1 [%]: 18.5520361991
Success rate Class 2 [%]: 0.0
Success rate Class 3 [%]: 80.8108108108
Success rate Class 4 [%]: 60.8294930876
好的,现在,您可以看到结果在各个类中的分布非常糟糕.永远不会预测0级和2级.我认为,这暗示了我的网络或更可能是我的数据存在问题.
我可以在这里发布很多代码,但是我认为在下图中显示我为获得MFCC功能所采取的所有步骤更加有意义.请注意,我使用整个信号时没有开窗只是为了说明.这样看起来还好吗? MFCC值非常大,不是应该小很多吗? (在将它们使用minmaxscaler将它们馈送到网络之前,我将它们按比例缩小,然后将所有数据传输到[-2,2],也尝试了[0,1])
这是我用于Melfilter库的代码,我在离散余弦变换之前直接应用此代码以提取MFCC特征(我从这里获得了它:*):
def freqToMel(freq):
'''
Calculate the Mel frequency for a given frequency
'''
return 1127.01048 * math.log(1 + freq / 700.0)
def melToFreq(mel):
'''
Calculate the frequency for a given Mel frequency
'''
return 700 * (math.exp(freq / 1127.01048 - 1))
def melFilterBank(blockSize):
numBands = int(mfccFeatures)
maxMel = int(freqToMel(maxHz))
minMel = int(freqToMel(minHz))
# Create a matrix for triangular filters, one row per filter
filterMatrix = numpy.zeros((numBands, blockSize))
melRange = numpy.array(xrange(numBands + 2))
melCenterFilters = melRange * (maxMel - minMel) / (numBands + 1) + minMel
# each array index represent the center of each triangular filter
aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0
aux = (numpy.exp(melCenterFilters * aux) - 1) / 22050
aux = 0.5 + 700 * blockSize * aux
aux = numpy.floor(aux) # Arredonda pra baixo
centerIndex = numpy.array(aux, int) # Get int values
for i in xrange(numBands):
start, centre, end = centerIndex[i:i + 3]
k1 = numpy.float32(centre - start)
k2 = numpy.float32(end - centre)
up = (numpy.array(xrange(start, centre)) - start) / k1
down = (end - numpy.array(xrange(centre, end))) / k2
filterMatrix[i][start:centre] = up
filterMatrix[i][centre:end] = down
return filterMatrix.transpose()
我该怎么做才能获得更好的预测结果?
解决方法:
在这里,我举了一个通过语音识别性别的例子.在此示例中,我使用了Hyke-dataset1.这只是一个快速实例.如果一个人想认真地进行性别认同,那么可能会做得更好.但总的来说,错误率会降低:
Build up data...
Train network...
Number of training patterns: 94956
Number of test patterns: 31651
Input and output dimensions: 13 2
Train network...
epoch: 0 train error: 62.24% test error: 61.84%
epoch: 1 train error: 34.11% test error: 34.25%
epoch: 2 train error: 31.11% test error: 31.20%
epoch: 3 train error: 30.34% test error: 30.22%
epoch: 4 train error: 30.76% test error: 30.75%
epoch: 5 train error: 30.65% test error: 30.72%
epoch: 6 train error: 30.81% test error: 30.79%
epoch: 7 train error: 29.38% test error: 29.45%
epoch: 8 train error: 31.92% test error: 31.92%
epoch: 9 train error: 29.14% test error: 29.23%
我使用了scikits.talkbox的MFCC实现.也许下面的代码对您有所帮助. (性别识别无疑比情感检测要容易得多.也许您需要更多不同的功能.)
import glob
from scipy.io.wavfile import read as wavread
from scikits.talkbox.features import mfcc
from pybrain.datasets import ClassificationDataSet
from pybrain.utilities import percentError
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules import SoftmaxLayer
def report_error(trainer, trndata, tstdata):
trnresult = percentError(trainer.testOnClassData(), trndata['class'])
tstresult = percentError(trainer.testOnClassData(dataset=tstdata), tstdata['class'])
print "epoch: %4d" % trainer.totalepochs, " train error: %5.2f%%" % trnresult, " test error: %5.2f%%" % tstresult
def main(auido_path, coeffs=13):
dataset = ClassificationDataSet(coeffs, 1, nb_classes=2, class_labels=['male', 'female'])
male_files = glob.glob("%s/male_audio/*/*_1.wav" % auido_path)
female_files = glob.glob("%s/female_audio/*/*_1.wav" % auido_path)
print "Build up data..."
for sex, files in enumerate([male_files, female_files]):
for f in files:
sr, signal = wavread(f)
ceps, mspec, spec = mfcc(signal, nwin=2048, nfft=2048, fs=sr, nceps=coeffs)
for i in range(ceps.shape[0]):
dataset.appendLinked(ceps[i], [sex])
tstdata, trndata = dataset.splitWithProportion(0.25)
trndata._convertToOneOfMany()
tstdata._convertToOneOfMany()
print "Number of training patterns: ", len(trndata)
print "Number of test patterns: ", len(tstdata)
print "Input and output dimensions: ", trndata.indim, trndata.outdim
print "Train network..."
fnn = buildNetwork(coeffs, int(coeffs*1.5), 2, outclass=SoftmaxLayer, fast=True)
trainer = BackpropTrainer(fnn, dataset=trndata, learningrate=0.005)
report_error(trainer, trndata, tstdata)
for i in range(100):
trainer.trainEpochs(1)
report_error(trainer, trndata, tstdata)
if __name__ == '__main__':
main("/path/to/hyke/audio_data")
1 Azarias Reda,Saurabh Panjwani和Edward Cutrell:Hyke:面向发展中地区的低成本远程出勤跟踪系统,第五届ACM发展中地区网络系统研讨会(NSDR).