Thursday, January 4, 2018

AI now has a human style voice.

Home / Technology / Google Develops Voice AI That Is Indistinguishable From Humans | Tacotron 2

Google Develops Voice AI That Is Indistinguishable From Humans | Tacotron 2

in Technology January 3, 2018

[Estimated read time: 3 minutes]

Google develops Tecotron 2 that makes machine generated speech sound less robotic and more like a human.
They used neural networks trained on text transcripts and speech examples.
The system synthesizes speech with WaveNet-level audio quality and Tacotron-level prosody.

Research on generating natural speech from a given text (text-to-speech synthesis, TTS) has been going on for decades. In a last few couple of years, there has been impressive progress.

You are familiar with Google voice service, it’s available in both male and female voices. The robotic voice is a staple in our culture, like Microsoft’s Cortana or Apple’s Siri. As the years have gone by Google’s AI voice has started to sound less robotic and more like a human. And now, it is almost indistinguishable from humans.

Google engineers incorporated ideas from past work like WaveNet and Tacotron, and enhanced the techniques to end up with new system, Tecotron 2. In order to achieve human-like speech, they used neural networks trained on only text transcripts and speech examples, rather than using any complicated linguistic and acoustic features as input.

Model Architecture

The system contains two main components –

A recurrent sequence-to-sequence feature prediction network optimized for TTS to map sequence of letters to a sequence of features, encoding the audio.
An improved version of WaveNet that produces time-domain waveform samples based on the predicted spectrogram frames.

Tacotron 2’s model architecture

The sequence-to-sequence model features an 80 dimensional audio spectrogram (with frames measured every 12.5 milliseconds) that captures words, speed, volume and intonation. These features are eventually converted into 16-bit samples at 24 kHz waveform using an enhanced-WaveNet version.

The resulting system synthesizes speech with WaveNet-level audio quality and Tacotron-level prosody. It can be trained on data without relying on any complicated feature engineering, and accomplishes state-of-the-art sound quality very close to that of natural human voice.

Unlike other core artificial intelligence research the company does, this technology is immediately useful to Google. For instance, first appeared in 2016, WaveNet is now used in Google Assistant. Tacotron 2 would be a more powerful addition to the service.

Reference: arXiv | 1712.05884

Audio Samples

Below, we have attached some samples. Each sentence is generated by artificial intelligence program and the other is a human. Can you figure out which one is AI?

“That girl did a video about Star Wars lipstick.”

Audio Player