Thursday, January 4, 2018

AI now has a human style voice.

Home / Technology / Google Develops Voice AI That Is Indistinguishable From Humans | Tacotron 2

Google Develops Voice AI That Is Indistinguishable From Humans | Tacotron 2

[Estimated read time: 3 minutes]
  • Google develops Tecotron 2 that makes machine generated speech sound less robotic and more like a human. 
  • They used neural networks trained on text transcripts and speech examples.
  • The system synthesizes speech with WaveNet-level audio quality and Tacotron-level prosody.
Research on generating natural speech from a given text (text-to-speech synthesis, TTS) has been going on for decades. In a last few couple of years, there has been impressive progress.
You are familiar with Google voice service, it’s available in both male and female voices. The robotic voice is a staple in our culture, like Microsoft’s Cortana or Apple’s Siri. As the years have gone by Google’s AI voice has started to sound less robotic and more like a human. And now, it is almost indistinguishable from humans.
Google engineers incorporated ideas from past work like WaveNet and Tacotron, and enhanced the techniques to end up with new system, Tecotron 2. In order to achieve human-like speech, they used neural networks trained on only text transcripts and speech examples, rather than using any complicated linguistic and acoustic features as input.

Model Architecture

The system contains two main components –
  1. A recurrent sequence-to-sequence feature prediction network optimized for TTS to map sequence of letters to a sequence of features, encoding the audio.
  2. An improved version of WaveNet that produces time-domain waveform samples based on the predicted spectrogram frames.
Tacotron 2’s model architecture
The sequence-to-sequence model features an 80 dimensional audio spectrogram (with frames measured every 12.5 milliseconds) that captures words, speed, volume and intonation. These features are eventually converted into 16-bit samples at 24 kHz waveform using an enhanced-WaveNet version.
The resulting system synthesizes speech with WaveNet-level audio quality and Tacotron-level prosody. It can be trained on data without relying on any complicated feature engineering, and accomplishes state-of-the-art sound quality very close to that of natural human voice.
Unlike other core artificial intelligence research the company does, this technology is immediately useful to Google. For instance, first appeared in 2016, WaveNet is now used in Google Assistant. Tacotron 2 would be a more powerful addition to the service.
Reference: arXiv | 1712.05884

Audio Samples

Below, we have attached some samples. Each sentence is generated by artificial intelligence program and the other is a human. Can you figure out which one is AI?
“That girl did a video about Star Wars lipstick.”
Audio Player Audio Player
“George Washington was the first President of the United States.”
Audio Player Audio Player
“She earned a doctorate in sociology at Columbia University.”
Audio Player Audio Player
In an evaluation, Google asked humans to rate the naturalness of the speech. The model achieved a Mean Opinion Score (MOS) of 4.53 comparable to 4.58 MOS for professionally recorded speech.
More Samples: Google.Github.io

Additional Capabilities of Tacotron 2

It can pronounce complex and out-of-the-context words. 
“Basilar membrane and otolaryngology are not auto-correlations.”
Audio Player
It takes care of spelling errors. 
“This is really awesome!”
Audio Player
It learns stress and intonation (capitalizing words changes the overall intonation)
“The buses aren’t the problem, they actually provide a solution.”
Audio Player
“The buses aren’t the PROBLEM, they actually provide a SOLUTION.”
Audio Player
It is good at tongue twisters.
“Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?”
Audio Player

Limitations

The sample sounds great, but there are still a few problems to be solved. The system faces issues while pronouncing complicated words like “merlot” and “decorum”. In extreme cases, it randomly creates strange noises.
For now, the system can’t generate audio in realtime and generated speech can’t be controlled, like directing it to sound sad or happy. Furthermore, it is only trained to mimic a female voice; to speak like another female or like a male, developers would need to train the system again.

No comments:

Post a Comment

When money is a problem, plant a garden.

 Gentle People:  As much as I would like to answer every email, and send money to every non-profit charity, I can't!! Please take this a...