Tutorial 6: Speech Mashups: A framework for multimodal mobile services and voice search

Presented by

Giuseppe Di Fabbrizio

Abstract

Speech is becoming a more attractive interface for mobile devices since it can overcome the input limitations of these small devices. Moreover, speech is a direct, intuitive interface that requires no learning and it is safer for multitasking users. And with the proliferation of web content - from business searches, mapping services, and game applications - it makes sense to combine, or mash up, speech interfaces with web services.

However, mobiles have limited computational capabilities to perform speech processing tasks including automatic speech recognition and text-to-speech conversion that are required for speech interfaces, especially when large vocabularies or high- quality synthesis is involved. One popular solution is to move the speech processing resources into the network by concentrating the heavy computation load in server farms. Some successful services exploit this approach, but to date, these services perform a single specific task and it is unclear how easily these services can expand to perform other tasks, nor is it known whether they can scale to accommodate large deployments.

This tutorial address the real-world challenges to speech-enable mobile applications by introducing a novel approach that leverages web services and cloud computing making easier to combine web content with a speech interface. We show how to integrate automatic speech recognition, text-to- speech synthesis, natural language understanding and multimodal understanding technologies into multimodal mobile services. We also provide an overview of multimodal user interfaces, an introduction to speech recognition and speech synthesis, some element of gesture recognition, and a detailed description of how to integrate multimodal interfaces and web services with the most popular mobile application environment with applications for voice search. The tutorial examples are based on a publicly available speech mashup portal and will allow students, researchers, and speech practitioners to experiment with variety of mobile multimodal services by exploiting industrial-strength speech processing technology.

Outline

Introduction
1. Definition of multimodality
2. Multimodal interfaces
3. Multimodal integration and understanding
4. Overview of speech recognition, text-to-speech synthesis, and gesture recognition techniques
5. Multimodal mobile examples
The web as platform

Web services
Web mashups
Cloud computing

Merging speech and web services

Speech mashups
Speech over the data network
REST-based APIs
Speech mashups portal
Examples of iPhone-based multimodal applications

Applications

Multimodal browser for iPhone
Step-by-step example

Voice search

Survey of recent work on voice search
Language modeling for voice search
Example of voice search service with speech mashup

Challenges and future directions

Cloud computing for speech processing
Personalization and adaptation
Multimodal output generation

Material to be Distributed

The material will include: tutorial slides, downloadable code examples for iPhone, BlackBerry and other mobile phones, multimodal browser prototype for iPhone, and accounts for the AT&T Speech Mashup portal to test the tutorial examples.

Biography

Giuseppe Di Fabbrizio is a Lead Member of Research Staff in the IP & Voice Services Research Laboratory at AT&T Labs - Research in Florham Park, NJ. During his career, he has conducted research on multimodal and spoken dialog systems, conversational agents, natural language generation, multimodal and speech system architectures, platforms and services, publishing several conference and journal papers on these subjects. He was instrumental in the development and deployment of the AT&T VoiceTone® Dialog Automation product for the AT&T business enterprise customers and the recipient of the 2008 AT&T Science and Technology Medal Award for outstanding technical innovation and leadership in the advancement of spoken language technologies, architectures, and services. Di Fabbrizio is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), an elected member (2009-2011) of the IEEE Signal Processing Society's "Speech and Language Processing Technical Committee" (SLTC) in the area of dialog systems, serves as editor of the SLTCs quarterly newsletter, and contributes as a program committee member and technical reviewer for numerous international conferences, journals, and workshops. Prior joining AT&T, he worked as a Senior Researcher at Telecom Italia Lab (formerly CSELT, now mostly Loquendo).