A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition

Journal of Intelligent Systems 29 (1):1261-1274 (2019)
  Copy   BIBTEX

Abstract

Deep neural networks (DNNs) have been playing a significant role in acoustic modeling. Convolutional neural networks (CNNs) are the advanced version of DNNs that achieve 4–12% relative gain in the word error rate (WER) over DNNs. Existence of spectral variations and local correlations in speech signal makes CNNs more capable of speech recognition. Recently, it has been demonstrated that bidirectional long short-term memory (BLSTM) produces higher recognition rate in acoustic modeling because they are adequate to reinforce higher-level representations of acoustic data. Spatial and temporal properties of the speech signal are essential for high recognition rate, so the concept of combining two different networks came into mind. In this paper, a hybrid architecture of CNN-BLSTM is proposed to appropriately use these properties and to improve the continuous speech recognition task. Further, we explore different methods like weight sharing, the appropriate number of hidden units, and ideal pooling strategy for CNN to achieve a high recognition rate. Specifically, the focus is also on how many BLSTM layers are effective. This paper also attempts to overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN. Next, various non-linearities with or without dropout are analyzed for speech tasks. Experiments indicate that proposed hybrid architecture with speaker-adapted features and maxout non-linearity with dropout idea shows 5.8% and 10% relative decrease in WER over the CNN and DNN systems, respectively.

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 91,322

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Similar books and articles

Recognition of continuous speech requires top-down processing.Kenneth N. Stevens - 2000 - Behavioral and Brain Sciences 23 (3):348-348.
II—Jane Heal: Illocution, Recognition and Cooperation.Jane Heal - 2013 - Aristotelian Society Supplementary Volume 87 (1):137-154.
Perceptual units in speech recognition.Dominic W. Massaro - 1974 - Journal of Experimental Psychology 102 (2):199.
Merging information versus speech recognition.Irene Appelbaum - 2000 - Behavioral and Brain Sciences 23 (3):325-326.
An Optimized Face Recognition System Using Cuckoo Search.Preeti Malhotra & Dinesh Kumar - 2019 - Journal of Intelligent Systems 28 (2):321-332.

Analytics

Added to PP
2019-12-28

Downloads
13 (#1,006,512)

6 months
6 (#522,885)

Historical graph of downloads
How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

No references found.

Add more references