ICASSP 2010 - 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing - March 14 - 19, 2010 - Dallas, Texas, USA

Tutorial 6: Speech Mashups: A framework for multimodal mobile services and voice search

Presented by

Giuseppe Di Fabbrizio


Speech is becoming a more attractive interface for mobile devices since it can overcome the input limitations of these small devices. Moreover, speech is a direct, intuitive interface that requires no learning and it is safer for multitasking users. And with the proliferation of web content - from business searches, mapping services, and game applications - it makes sense to combine, or mash up, speech interfaces with web services.

However, mobiles have limited computational capabilities to perform speech processing tasks including automatic speech recognition and text-to-speech conversion that are required for speech interfaces, especially when large vocabularies or high- quality synthesis is involved. One popular solution is to move the speech processing resources into the network by concentrating the heavy computation load in server farms. Some successful services exploit this approach, but to date, these services perform a single specific task and it is unclear how easily these services can expand to perform other tasks, nor is it known whether they can scale to accommodate large deployments.

This tutorial address the real-world challenges to speech-enable mobile applications by introducing a novel approach that leverages web services and cloud computing making easier to combine web content with a speech interface. We show how to integrate automatic speech recognition, text-to- speech synthesis, natural language understanding and multimodal understanding technologies into multimodal mobile services. We also provide an overview of multimodal user interfaces, an introduction to speech recognition and speech synthesis, some element of gesture recognition, and a detailed description of how to integrate multimodal interfaces and web services with the most popular mobile application environment with applications for voice search. The tutorial examples are based on a publicly available speech mashup portal and will allow students, researchers, and speech practitioners to experiment with variety of mobile multimodal services by exploiting industrial-strength speech processing technology.


  1. Introduction
    1. Definition of multimodality
    2. Multimodal interfaces
    3. Multimodal integration and understanding
    4. Overview of speech recognition, text-to-speech synthesis, and gesture recognition techniques
    5. Multimodal mobile examples
  2. The web as platform
    1. Web services
    2. Web mashups
    3. Cloud computing
  3. Merging speech and web services
    1. Speech mashups
    2. Speech over the data network
    3. REST-based APIs
    4. Speech mashups portal
    5. Examples of iPhone-based multimodal applications
  4. Applications
    1. Multimodal browser for iPhone
    2. Step-by-step example
  5. Voice search
    1. Survey of recent work on voice search
    2. Language modeling for voice search
    3. Example of voice search service with speech mashup
  6. Challenges and future directions
    1. Cloud computing for speech processing
    2. Personalization and adaptation
    3. Multimodal output generation

Material to be Distributed

The material will include: tutorial slides, downloadable code examples for iPhone, BlackBerry and other mobile phones, multimodal browser prototype for iPhone, and accounts for the AT&T Speech Mashup portal to test the tutorial examples.


Giuseppe Di Fabbrizio is a Lead Member of Research Staff in the IP & Voice Services Research Laboratory at AT&T Labs - Research in Florham Park, NJ. During his career, he has conducted research on multimodal and spoken dialog systems, conversational agents, natural language generation, multimodal and speech system architectures, platforms and services, publishing several conference and journal papers on these subjects. He was instrumental in the development and deployment of the AT&T VoiceTone® Dialog Automation product for the AT&T business enterprise customers and the recipient of the 2008 AT&T Science and Technology Medal Award for outstanding technical innovation and leadership in the advancement of spoken language technologies, architectures, and services. Di Fabbrizio is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), an elected member (2009-2011) of the IEEE Signal Processing Society's "Speech and Language Processing Technical Committee" (SLTC) in the area of dialog systems, serves as editor of the SLTC’s quarterly newsletter, and contributes as a program committee member and technical reviewer for numerous international conferences, journals, and workshops. Prior joining AT&T, he worked as a Senior Researcher at Telecom Italia Lab (formerly CSELT, now mostly Loquendo).

©2015 Conference Management Services, Inc. »« email: webmaster@icassp2010.com »« Last updated Wednesday, December 23, 2009