Tutorial 9: Speech Segregation

Presented by

DeLiang Wang

Abstract

The acoustic environment typically contains multiple simultaneous events, and the target speech usually occurs with other interfering sounds. This creates a problem of speech segregation, popularly known as the cocktail party problem. Speech segregation has a variety of practical applications, including robust speech and speaker recognition, hearing prosthesis, and audio information retrieval (or audio data mining). As a result, a large number of studies in signal processing have been devoted to speech segregation, whose importance becomes especially acute in recent years as single-source processing (e.g. automatic speech recognition) continues to mature.

This tutorial is designed to introduce the latest developments in speech segregation, particularly for monaural (single-microphone) mixtures. Two primary approaches to monaural speech segregation are computational auditory scene analysis (CASA) and speech enhancement. CASA is motivated by human auditory scene analysis (ASA), and aims at sound separation based on ASA principles, including harmonicity, amplitude/frequency modulation, and onset and offset. Speech enhancement, on the other hand, is performed based on statistical analysis of speech and noise, followed by the estimation of clean speech from noisy speech. Classical methods of speech enhancement include spectral subtraction, Wiener filtering, and mean-square error estimation. This tutorial will systematically introduce the underlying principles and algorithms of CASA and speech enhancement. A common treatment of CASA and speech enhancement, which are developed from very different perspectives, will be a novel feature of this tutorial.

The proposed tutorial intends to provide the participants a solid understanding of speech segregation with the following emphases. First, explain mathematical/computational foundations behind speech segregation systems, in conjunction with real-world applications. Second, examine speech segregation from both perceptual and computational perspectives. Third, draw comparisons between CASA and speech enhancement.

Speaker Biography

DeLiang Wang received the B.S. degree in 1983 and the M.S. degree in 1986 from Peking (Beijing) University, China, and the Ph.D. degree in 1991 from the University of Southern California, all in computer science. Since 1991, he has been with the Department of Computer Science & Engineering and the Center for Cognitive Science at the Ohio State University, where he is Professor. He was a visiting scholar in Harvard University in 1998-1999 and in Oticon A/S in 2006-2007.

His main research interest is in machine perception. He has published more than 70 papers in leading scientific journals, and numerous conference papers and book chapters. Among his honors are the U.S. Office of Naval Research Young Investigator Award (1996), IEEE Fellow (2004), and the Helmholtz Award from International Neural Network Society (2008).

He currently serves on the editorial boards of five journals. He has served either as organizer for or on program committee of many scientific conferences, including ICASSP, World Congress on Computational Intelligence, and International Joint Conference on Neural Networks. He is a member of IEEE Signal Processing Society Speech and Language Processing Technical Committee, IEEE Computational Intelligence Society Fellows Committee, and the Governing Board of International Neural Network Society.