An Integrated Deep Learning Approach to Acoustic Signal Pre-processing and Acoustic Modeling with Applications to Robust Automatic Speech Recognition
School of ECE, Georgia Tech, USA
We cast the classical speech processing problem into a new nonlinear regression setting by mapping log power spectral features of noisy to clean speech based on deep neural networks (DNNs). DNN-enhanced speech obtained by the proposed approach demonstrates better speech quality and intelligibility than those obtained with conventional state-of-the-art algorithms. Furthermore, this new paradigm also facilitates an integrated deep learning framework to train the three key modules in an automatic speech recognition (ASR) system, namely signal conditioning, feature extraction and acoustic phone models, altogether in a unified manner. The proposed framework was tested on recent challenging ASR tasks in CHiME-2, CHiME-4 and REVERB, which are designed to evaluate ASR robustness in mixed speakers, multi-channel, and reverberant conditions. Leveraging upon this new approach, our team scored the lowest word error rates in all three tasks with acoustic pre-processing algorithms for speech separation, microphone array based speech enhancement and speech dereverberation.
Chin-Hui Lee is a professor at School of Electrical and Computer Engineering, Georgia Institute of Technology. Before joining academia in 2001, he had accumulated 20 years of industrial experience ending in Bell Laboratories, Murray Hill, as a Distinguished Member of Technical Staff and Director of the Dialogue Systems Research Department. Dr. Lee is a Fellow of the IEEE and a Fellow of ISCA. He has published over 450 papers and 30 patents, with more than 30,000 citations and an h-index of 70 on Google Scholar. He received numerous awards, including the Bell Labs President’s Gold Award in 1998. He won the SPS’s 2006 Technical Achievement Award for “Exceptional Contributions to the Field of Automatic Speech Recognition”. In 2012 he gave an ICASSP plenary talk on the future of automatic speech recognition. In the same year he was awarded the ISCA Medal in scientific achievement for “pioneering and seminal contributions to the principles and practice of automatic speech and speaker recognition”.
Audio Equalization and Reverberation
Aalto University, Finland
This talk will review advances in two audio signal processing topics, equalization and artificial reverberation, which are needed in augmented and virtual reality audio. The graphic equalizer is a standard tool in music and audio production, which allows the free adjustment of the gain at several frequency bands. The control of the gains can be manual or automatic, depending of the application. The underlying signal processing structure is either a parallel or a cascade IIR filter. In the past few years, we have learned, at last, how to accurately design such filters. Example applications of automatic audio equalization will be discussed in this talk. Artificial reverberation has a long history, but new exciting ideas are introduced continuously. Whereas a large proportion of artificial reverberation research has focused on the imitation of concert hall acoustics, the modeling of outdoor acoustic environments has become important for gaming, virtual reality, and simulation of noise propagation. The use of velvet noise, a sparse pseudo-random sequence, will be described for creating computationally efficient reverberation effects.
Prof. Vesa Välimäki is the Vice Dean for research at the Aalto University School of Electrical Engineering, Espoo, Finland. He is a Full Professor of audio signal processing at Aalto University. He received the Master of Science in Technology and the Doctor of Science in Technology degrees, both in electrical engineering, from the Helsinki University of Technology, Espoo, Finland, in 1992 and 1995, respectively. In 1996, he was a Postdoctoral Research Fellow at the University of Westminster, London, UK. In 2008-2009, he was a Visiting Scholar at the Center for Computer Research in Music and Acoustics (CCRMA), Stanford University, Stanford, CA, USA. He is a Fellow of the AES (Audio Engineering Society), a Fellow of the IEEE, and a Life Member of the Acoustical Society of Finland. He is a Senior Area Editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing. In 2016, he was the Guest Editor of the special issue of Applied Sciences on audio signal processing. He was the Chairman of the International Conference on Digital Audio Effects, DAFx-08, in 2008, and was the Chairman of the Sound and Music Computing Conference, SMC-17, in 2017.
From Fourier, Wavelet and Sparse Signal Representations to CNNs
C.-C. Jay Kuo
University of Southern California, USA
The convolution neural network (CNN) provides a powerful tool for image and video processing and understanding nowadays. In this talk, I will build a bridge between the traditional single-layer signal representation methods such as the Fourier, wavelet and sparse representation and the modern multi-layer signal analysis approach based on CNNs. To begin with, I introduce a RECOS transform as a basic building block of CNNs, where “RECOS” is an acronym for “REctified-COrrelations on a Sphere”. It consists of two main concepts: data clustering on a sphere and rectification. Then, I interpret a CNN as a network that implements the guided multi-layer RECOS transform. Besides offering a full explanation to the operating principle of CNNs, I discuss how guidance is provided by labels through backpropagation (BP) in the training and show that CNNs can give a full spectrum of learning paradigms – from unsupervised, weakly supervised to fully supervised learning.
Dr. C.-C. Jay Kuo received his Ph.D. degree from the Massachusetts Institute of Technology in 1987. He is now with the University of Southern California (USC) as Director of the Media Communications Laboratory and Dean’s Professor in Electrical Engineering-Systems. His research interests are in the areas of digital media processing, compression, communication and networking technologies. Dr. Kuo was the Editor-in-Chief for the IEEE Trans. on Information Forensics and Security in 2012-2014. He was the Editor-in-Chief for the Journal of Visual Communication and Image Representation in 1997-2011, and served as Editor for 10 other international journals. Dr. Kuo received the 1992 National Science Foundation Young Investigator (NYI) Award, the 1993 National Science Foundation Presidential Faculty Fellow (PFF) Award, the 2010 Electronic Imaging Scientist of the Year Award, the 2010-11 Fulbright-Nokia Distinguished Chair in Information and Communications Technologies, the 2011 Pan Wen-Yuan Outstanding Research Award, the 2014 USC Northrop Grumman Excellence in Teaching Award, the 2016 USC Associates Award for Excellence in Teaching, the 2016 IEEE Computer Society Taylor L. Booth Education Award, the 2016 IEEE Circuits and Systems Society John Choma Education Award, the 2016 IS&T Raymond C. Bowman Award, and the 2017 IEEE Leon K. Kirchmayer Graduate Teaching Award. Dr. Kuo is a Fellow of AAAS, IEEE and SPIE. He has guided 140 students to their Ph.D. degrees and supervised 25 postdoctoral research fellows. Dr. Kuo is a co-author of about 250 journal papers, 900 conference papers and 14 books.