Interacting with our devices through speech has become a common and useful method. However, the way our devices talk to us is often unexpressive and to some extent, unnatural, which can make the intent behind sentences hard to understand. AT&T Labs researchers have approached this problem from a unique angle; they’ve developed an innovative text-to-speech app called StorEbook that reads children’s stories expressively in character-appropriate voices. This app is currently demonstrated by reading children’s books, such as Goldilocks and the Three Bears. In StorEbook’s rendition of the Goldilocks story, Papa Bear’s voice is deeper, while Baby Bear has a high voice. It’s also designed to use the right voice inflection based on the intent of what’s being spoken: a question, exclamation, authority, surprise, fear, sadness, etc.
The backbone of this project is AT&T’s Natural VoicesTM. Natural Voices is AT&T’s state-of-the-art text-to-speech product that converts text into natural-sounding, synthesized speech in a variety of voices and languages. From hours of recorded speech, computers digitally slice up thousands upon thousands of short speech segments called phonemes—simple sounds that make up all speech—and store them in a database, along with such information as the pitch, duration and amplitude of each slice.
As a mother of two small children who speaks five languages, Labs researcher Taniya Mishra spends a lot of time reading to her children, as well as working on speech technology. One night, while reading to her young daughter, Taniya realized that enhancing a text-to-speech system so that it could read children’s stories, even simple ones, has the potential to significantly advance work in this field. “Current text-to-speech technology doesn’t do a good job of expressing the complex and often strong emotions in children’s stories,” Mishra says. “My three-year-old would walk away from anything that sounded like a computer reading a story. She’d just think it was boring. Kids are the toughest customers sometimes.”
StorEbook Reader is designed as a prototype for AT&T Labs researchers to develop, understand and solve the complex technical challenges behind creating next-gen text-to-speech technologies. In the future, this technology could be applied to:
- Automatic Character Identification. Researchers envision a system that can automatically recognize the multiple characters in any story and their salient personality traits, so that the system can render the story using character-appropriate synthetic voices.
- Advanced Affect Generation. Researchers are exploring how to make voices embody a character’s emotions so that a wolf sounds scary and a teacup sounds cute.
- Personalized Voices. Imagine hearing a story in grandma’s voice, even though she is miles away. Researchers are working to create synthetic voices that sound like a familiar person using just a couple hundred sentences recorded by that person. Besides the delight of having a story read in a familiar voice, personalized voices allows a person’s voice to live on for future generations.
Taniya Mishra is a senior member of technical staff at AT&T Labs in the Speech Algorithms and Engines Research Department. Mishra completed her Ph.D. in Computer Science from the OGI School of Science & Engineering at OHSU in 2008. Her thesis introduced a new algorithm for intonation analysis and applied it to text-to-speech synthesis. Today, Mishra works on speech synthesis, voice-enabled search, prosody modeling, voice signatures and their application to speaker recognition, speech synthesis and other speech technologies.