Google's WaveNet is bringing computerized speech into the uncanny valley.
Image: Shaye Anderson
The choppy, cybernetic voices of digital assistants like Siri may not sound so mechanical for much longer, thanks to a significant breakthrough in using artificial intelligence to generate realistic human speech.
In a new paper, scientists at Google-owned AI shop DeepMind have unveiled WaveNet, a neural network that generates audio waveforms by predicting and adapting to its own output in real-time. The result is dramatically more natural-sounding computerized speech, which the researchers say reduces the perceived gap between human and computer voices speaking both English and Chinese by over 50 percent.
The system's predictive model is a far cry from the synthesized speech systems used by "digital assistant" apps like Siri. Instead of using a "concatenative" speech system that pieces together from a library of speech fragments recorded by one speaker (in Siri's case, voice actress Susan Bennett), WaveNet is trained on a massive database, then generates raw waveforms one audio sample at time using what's known as an "autoregressive" model—meaning each individual frame of the waveform is predicted based on the frames that preceded it. The neural net was developed from a similar model called PixelCNN, which does the same for computer vision by predicting images one pixel at a time.
"To make sure it knew which voice to use for any given utterance, we conditioned the network on the identity of the speaker," the DeepMind researchers wrote in a blog post. "Interestingly, we found that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning."
WaveNet isn't just for speech either: it can also generate some styles of music. Training the network on classical piano, for example, yielded some uncannily cohesive chord progressions in the researchers' testing.
But even weirder is what happens when the system isn't told what to do. Since WaveNet is autoregressive, it can still generate a voice even if it isn't given any text input, resulting in a predictive "babbling" that sounds like Siri practicing her glossolalia. The researchers also found that the system is eerily adept at picking up on non-verbal speech characteristics, like breathing and mouth movements.
To be sure, the voices and music generated by WaveNet still sound slightly off to a trained ear, and composing speech in this way still requires a massive amount of computing power. But when compared with current text-to-speech methods, the system makes a pretty compelling case that we're fast approaching the uncanny valley of computerized speech.