Convolutional Neural Networks for Speech Recognition

Emily Hua, KAILI Chen

Feb 15, 2017

aBSTRACT

In this talk, we will review GMM and DNN for speech recognition system and present:

  • Convolutional Neural Network (CNN)

Some related experimental results will also be shown to prove the effectiveness of using CNN as the acoustic model.

General architecture

OF A SPEECH RECOGNITION SYSTEM

DNNs can outperform GMMs at acoustic modeling for speech recognition on a variety of datasets including large datasets with large vocabularies.

  • DNNs use their parameters more efficient than GMMs
  • quite powerful in highly correlated features

CNN Architecture

MFSC Features

MFSC are MFCC without the DCT performed.

static, delta, delta-delta

INPUT DATA TO THE CNN

Two different ways can be used to organize speech input features to a CNN. The above example assumes 40 MFSC features plus first and second derivatives with a context window of 15 frames for each speech frame.  

convolution operation

The convolution in a simplified form is defined by:

 

 

and it is usually denoted by:

 

s(t)=(x*w)(t)
s(t)=(xw)(t)s(t)=(x*w)(t)

2-D convolution example

CONVOLUTION layer

represents the input feature map;

represents the output convolution feature map;

represents each local weight matrix.

O_i
OiO_i
Q_j = \sigma(\sum_{i=1}^{I} O_i * w_{i,j})
Qj=σ(i=1IOiwi,j)Q_j = \sigma(\sum_{i=1}^{I} O_i * w_{i,j})
Q_j
QjQ_j
w_{i,j}
wi,jw_{i,j}

Fully-connected layer

Convolution layer

pooling layer

MAX POOLING

The purpose of pooling layer is to reduce the resolution of feature maps.

back propagation

Convolution layer weights can be updated using the back-propagation algorithm;

Pooling layer has no weights so no learning is needed here.

Benefits of cnn for asr

  • Locality

Locality in the units of the convolution layer allows more robustness against   non-white noise.

  • Weight Sharing

A weight sharing strategy leads to a significant reduction in the number of parameters that have to be learned.

  • Pooling

This leads to minimal differences in the features extracted by the pooling layer when the input patterns are slightly shifted along the frequency dimension.

lEARNING wEIGHTS IN cnn

full weight sharing

limited weight sharing

(*only the convolution units that attached to the same pooling unit share the same convolutional weights)

1-D convolution is applied here

what is special about LWS?

different frequency band -> different weights

Weights in matrix form

full weight sharing

limited weight sharing

different convolutional sections, each of them will end up in the same pool

Limited weights Sharing

  • allows for detection of distinct feature patterns in different filter bands along the frequency axis

NOTED

  • can't have further convolution plies on top of the pooling ply. WHY? 

Text

different sections in pooling-ply in LWS are now unrelated and can not be convolved locally!

AGAIN, WHY GOOD?

CONFIGURATION

1 convolution ply + 1 pooling ply

 

 

 

pooling size of 6

shift size of 2

filter size of 8

150 feature maps for FWS

80 feature maps for LWS

2 fully connected hidden layers

(each with 1000 units)

Experiment results

CNN parameters: sub-sampling, # of feature maps, filter size

report the average of 3 runs

(pooling size)

(shift size)

(filter size)

(feature maps)

Experiment   results

No Energy Energy
LWS 20.61% 20.39%
FWS 21.19% 20.55%

<------- works better on FWS than LWS

why better? 

adds a way to compare local frequency band with overall spectrum 

Average Max
Dev Set 19.63% 18.56%
Test Set 21.6% 20.39%

These results are consistent with image recognition applications

overall performance DNN vs. CNN on the TIMIT dataset

yay!  ----->

DNN-HMM     +       Kaldi 

Replace Kaldi's build-in NN models with Keras'

<-------------------- Recipes are in egs

<-------------------------   ./run.sh boom! 

  1.  Data Prep & Language model
  2.  MFCC Feature Extraction 
  3.  MonoPhone Training & Decoding                             < 
  4.  Deltas + Delta-deltas Training & decoding
  5.  DNN tranining (*) & decoding 

includes

dNN-HMM Kaldi  +  Keras

<--------------------------- configure this file to run

<------------------------- contains NN model 

$ git clone https://github.com/dspavankumar/keras-kaldi.git

(a) structure

(b) what is required

-------------------------------->

CNN-HMM KALDI  +  KERAS

import keras
from keras.optimizers import SGD
import keras.backend as K
from dataGenerator import dataGenerator
import sys
import os
if __name__ != '__main__':
    raise ImportError ('This script can only be run, and can\'t be imported')
if len(sys.argv) != 7:
    raise TypeError ('USAGE: train.py data_cv ali_cv data_tr ali_tr gmm_dir dnn_dir')
data_cv = sys.argv[1]
ali_cv  = sys.argv[2]
data_tr = sys.argv[3]
ali_tr  = sys.argv[4]
gmm     = sys.argv[5]
exp     = sys.argv[6]
## Learning parameters
learning = {'rate' : 0.1,
            'batchSize' : 256,
            'minEpoch' : 5,
            'lrScale' : 0.5,
            'lrScaleCount' : 18,
            'minValError' : 0.002}
os.makedirs (exp, exist_ok=True)
trGen = dataGenerator (data_tr, ali_tr, gmm, learning['batchSize'])
cvGen = dataGenerator (data_cv, ali_cv, gmm, learning['batchSize'])
## Initialise learning parameters and models
s = SGD(lr=learning['rate'], decay=0, momentum=0.5, nesterov=True)
m = keras.models.Sequential([
                keras.layers.Dense(1024, input_dim=trGen.inputFeatDim, activation='relu'),
                keras.layers.Dense(1024, activation='relu'),
                keras.layers.Dense(1024, activation='relu'),
                keras.layers.Dense(trGen.outputFeatDim, activation='softmax')])
## Initial training
m.compile(loss='categorical_crossentropy', optimizer=s, metrics=['accuracy'])
print ('Learning rate: %f' % learning['rate'])
h = [m.fit_generator (trGen, samples_per_epoch=trGen.numFeats,
        validation_data=cvGen, nb_val_samples=cvGen.numFeats,
        nb_epoch=learning['minEpoch']-1, verbose=1)]
m.save (exp + '/dnn.nnet.h5', overwrite=True)

train.py


"""
CNN
"""

from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D

model2 = Sequential()

model2.add(Convolution2D(150,8,8), input_shape=trGen.inputFeatDim)
"""
150 filters, each 8x8 (filter size of 8), input size could be (3,40,45): 3 channels (static, delta, delta-delta),\
 40x45 dimension depends on the context window, in this example assumimg 40 frequency band and \
45 1-D input feature maps (15 frames x 4 filter banks)
"""

model2.add(MaxPooling2D(6,6)) #pooling size of 6

model2.add(Flatten())

model2.add(Dense(1024))
model2.add(Activation('relu'))
model2.add(Dense(output_dim=treGen.outputFeatDim)
model2.add(Activation('softmax'))

23.7% PER on TIMIT dataset

using this DNN structure

Resources

Keras-kaldi interface:               https://github.com/dspavankumar/keras-kaldi

Kaldi-ASR:                                   https://github.com/kaldi-asr/kaldi

How to Train DNN with Kaldi: http://jrmeyer.github.io/kaldi/2016/12/15/DNN-AM-Kaldi.html

 

References

Q&A

Thank You ;)