The idea of a “universal translator,” a device that can instantly translate words from one language into any other, is a popular science fiction conceit.
But the technology, involving speech recognition and machine learning, may be moving into the realm of reality.
Rick Rashid, Microsoft’s chief research officer, recently demonstrated near-instantaneous translation of spoken English to Mandarin speech – with software that maintained the sound of the speaker’s voice. The technology was developed in part by U of T grad students working in the labs of computer science professors Geoffrey Hinton and Gerald Penn.
The software, which has an error ratio of about one in seven words, compared to one in four in earlier systems, relies on a technology using simplified mathematical models of neural circuits in the brain called “deep neural networks.” This enables computers to better recognize phonemes, the small units of sound that comprise speech.
Graduate students Abdel-rahman Mohamed and George Dahl began applying deep neural networks to speech recognition in 2009. They presented their research at a 2009 academic workshop, which drew the attention of Microsoft, and yielded invitations for both students to intern at Microsoft Research in Redmond, Washington. There, Mohamed and Dahl successfully applied their methods to speech tasks involving much larger vocabularies.
Another computer science graduate student involved in the research, Navdeep Jaitly, to implement voice search in the Android “jellybean” operating system, comparable to the iPhone’s Siri.
Today, most top speech labs are embracing deep neural networks, including IBM, a longtime leader in speech recognition research, where Mohamed has also worked. Penn’s speech lab has also since developed an alternative neural network model in collaboration with York University professor Hui Jiang and graduate student Ossama Abdel-Hamid.
The U of T researchers say the new business opportunities they’ve helped create are just the beginning. Hinton’s lab has already used deep neural networks to win several pattern-recognition competitions, including recognizing objects in images and predicting how well a potential drug molecule will bind to a target site. And Penn’s speech lab is in the process of digitizing the last 23 years of CBC Newsworld video to develop search algorithms for large collections of speech. Unlike Google Voice Search, which uses voice queries for hunting through web pages of text, this work uses text queries to search through large volumes of speech.
“This is important not just for speech researchers,” says Penn, “but for journalists, historians and anyone else who is interested in documenting the Canadian perspective on world affairs. Having all of this data around is great, but it’s of limited application if we can’t somehow navigate or search through it for topics of interest.”