Tutorial 5: Single and Multi Channel Feature Enhancement for Distant Speech Recognition

Presented by

John McDonough, Matthias Wölfel

Abstract

A complete system for distant speech recognition (DSR) typically consists of several distinct components, including those for speaker position tracking, beamforming, postfiltering and extraction of word hypotheses from enhanced waveforms. While it is tempting to isolate and optimize each component individually, experience has proven that such an approach cannot lead to optimal performance. For example, optimization of signal-to-noise ratio, as is typically done in the speech enhancement and acoustic array processing fields, does not necessarily lead to reductions in word error rates in DSR systems. Therefore optimization methods need to be investigated which are better suited to the enhancement of the acoustic features used in the speech recognition engine. Moreover, it is necessary to compensate for the channel effects which cannot be handled by the standard adaptation techniques used in current state of the art recognition engines, namely non-stationary additive noise as well as reverberation.

In this tutorial, we will discuss several examples of the interactions between the individual components of a DSR system. In addition, we will describe the synergies that become possible as soon as each component is no longer treated as a ”black box“. To wit, instead of treating each component as having solely an input and an output, it is necessary to peal back the lid and look inside. It is only then that it becomes apparent how the individual components of a DSR system can be viewed not as separate entities, but as the various organs of a complete body, and how optimal performance of such a system can be obtained. Among the topics covered by the tutorial will be single- and multi-channel speech enhancement techniques based on optimization criteria such as maximum likelihood, maximum negentropy, and minimum mutual information.

The single channel techniques we plan to present work in the dimension reduced logarithmic domain and are thus much closer to the final features used in speech recognition than speech enhancement techniques working in the time or in the frequency domain. As previously mentioned, it is necessary to compensate for those distortions which cannot be handled by the recognition system itself. Therefore, we will present techniques which compensate for either non-stationary additive distortions, late-reflection or jointly compensate for both distortions. We will additionally present the results of experiments conducted on data captured in real acoustic environments which conclusively demonstrate the effectiveness of the techniques discussed during the tutorial.

Biographies

John McDonough received a Ph.D. in electrical and computer engineering from the Johns Hopkins Univerity in April of 2000. John's Ph.D. Dissertation was supervised by Proj. Fred Jelinek. John taught the courses Man-Machine Communication and Microphone Arrays: Gateway to Hands Free Automatic Speech Recognition at the University of Karlsruhe for five years. Since moving to Saarland University in February of 2007, John has developed and taught the courses Distant Speech Recognition, and Weighted Finite-State Transducers in Speech and Natural Language Processing.

John has published dozens of conference and journal articles on all aspects of automatic speech recognition with distant microphones, including speaker tracking, beamforming, hidden Markov model parameter estimation, and word hypothesis search. In 2006, he co-authored a paper which won a ICASSP best student paper award. Having written complete software toolkits for all the component technologies mentioned above, John is intimately acquainted with all details of the related algorithms. John led the team that collected all audio and video data for the European Union project CHIL, Computers in the Human Interaction Loop. While at Karlsruhe, John was responsible for the development of audio technologies during the CHIL project. At the annual CHIL technology evaluations, John's team was dominant in the acoustic speaker tracking task. John's team also supplied beamformed speech material to all CHIL partners for the annual technology evaluations. After moving to Saarbrücken, John also led the team that won the AMI Speech Separation Challenge in June of 2007. In May of 2009, John Wiley & Sons published the book Distant Speech Recognition, which John wrote along with Matthias Woelfel. John uses this book as a primary text in his lectures at Saarland University.

Matthias Wölfel received the Diploma in electrical engineering and information technology from the Universität Karlsruhe (TH), Karlsruhe, Germany, in 2003 and the Doctor in computer science in 2009 at the same university. From September 2000 until June 2001, he participated in an exchange program with the University of Massachusetts, Dartmouth. In September 2002, he was with Carnegie Mellon University, Pittsburgh, PA, as a Visiting Researcher for a period of four months. His research interest includes feature extraction, adaptation, and enhancement based on single- and multi-microphone input. He now works with the Center for Art and Media in Karlsruhe, Germany.