Research/Blog
Text-to-Speech (TTS) using Tacotron
- January 15, 2020
- Posted by: vsinghal
- Category: Natural Language Processing Speech Applications
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars
Last Saturday (11th Jan 2020), CellStrat AI Lab Team Lead Indrajit Singh presented a superb workshop on “Text-2-speech (TTS) protocol using Tacotron and Tacotron 2” algorithm.
Here is a summary of the Tacotron algorithm :-
For a fan of the Marvel Cinematic Universe, the voice of J.A.R.V.I.S, the AI managing Tony Stark’s superhero affairs, would be the go-to example for an AI producing nuanced human speech. What if this possibility of human sounding speech by an AI were a reality? Researchers at Google claim to have managed to accomplish a similar feat through Tacotron 2.
In a paper titled, Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions, a group of researchers from Google claim that their new AI-based system, Tacotron 2, can produce near-human speech from textual content.
In recent years, end-to-end neural networks have become the state of the art for speech recognition tasks and they are now widely deployed in industry (Amodei et al., 2016). Naturally, this has led to the creation of systems to do the opposite – end-to-end speech synthesis from raw text. Very recently, neural TTS systems have become highly competitive with their conventional counterparts, showing high naturalness scores in a variety of incarnations.
What Is Tacotron 2 ?
Tacotron is a state-of-the-art end-to-end speech synthesis model, which can generate speech directly from graphemes or phonemes. Tacotron 2 is an improvement over Tacotron. Tacotron 2 is an AI-powered speech synthesis system that can convert text to speech.
How Does It Work?
Tacotron 2’s neural network architecture synthesises speech directly from text. It functions based on the combination of convolutional neural network (CNN) and recurrent neural network (RNN).
Tacotron 2 is said to be an amalgamation of the best features of Google’s WaveNet, a deep generative model of raw audio waveforms, and Tacotron, its earlier speech recognition project. The sequence-to-sequence model that generates mel spectrograms has been borrowed from Tacotron, while the generative model synthesising time domain waveforms from the generated spectrograms has been borrowed from WaveNet. WaveNet is an audio generative model. It takes a sequence of audio samples as input and predicts the most likely following audio sample. However WaveNet is not an end-to-end TTS model.
In a mel spectrogram, wave values are converted to STFT (Short Time Fourier Transform) and stored in a matrix. More precisely, one-dimensional speech signals are two-dimensional markers. It is easy to think that the voice is converted into a photo-like picture.
The Tacotron 2 is made up as follows :-
Tacotron 2 = Tacotron-made mel-spectrogram + WaveNet Vocoder – Griffin-Lim Algorithm
The Griff-Lim Algorithm first appeared in Tacotron 1. It is an algorithm that predicts discarded phase information by STFT when converted to spectrogram. It iteratively attempts to find the waveform whose STFT magnitude is closest to the generated spectrogram.
Let’s discuss Tacotron 1 architecture. It looks like this :-
The model takes characters as input and outputs the corresponding raw spectrogram, which is then fed to the Griffin-Lim reconstruction algorithm to synthesize speech.
The Encoder works on a character embedding of the input text. This is passed to a Pre-net layer and CBHG layer. Pre-net is a fully-connected neural network with Dropout for regularization. The Tacotron Encoder then uses a module called CBHG on top of the pre-net. The name of this module comes from its building blocks: a 1-D convolution bank (CB), followed by a highway network (H) and a bidirectional GRU (G). The GRU learns long-term dependencies in the sequence.
The decoder uses an attention mechanism on the encoder’s output, to produce mel-spectrogram frames. The Decoder also has a Pre-net layer, which is followed by a one-layer GRU, whose output is concatenated with the encoder’s output to produce the context vector through the attention mechanism. This GRU output is then concatenated with the context vector, to produce the input of the decoder RNN block. The decoder RNN module produces r (r-number) mel-spectrogram frames, and only the last one is used by the pre-net during the next time-step.
Now let’s look at Tacotron 2 architecture.
The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms.
The model is an encoder-attention-decoder setup where it uses ‘Location sensitive attention’. The first part is an Encoder which converts the character sequence into word embedding vector. This representation is later consumed by the Decoder to predict spectrograms. The decoder is an autoregressive recurrent neural network which predicts a mel spectrogram from the encoded input sequence one frame at a time.
When it comes to differences between the two Tactotron models, Tacotron 1 had problems predicting the end of sequence token and tended to get stuck. Tacotron 2 discards the reduction factor, but adds location sensitive attention as in Chorowski et al’s ASR work to help the attention move forward.
Regarding the location sensitive attention, the only difference between location-based attention and content-based attention is the way the network does the attention scoring. The location-based attention doesn’t care at all about the content of the input tokens but only care about their locations and the distances that exist between these tokens (Ref : https://github.com/Rayhane-mamah/Tacotron-2/wiki/Spectrogram-Feature-prediction-network and https://arxiv.org/pdf/1308.0850.pdf).
Rayhane Mama’s Tacotron implementation uses Hybrid Attention. As the name suggests, is a mix of the two previously discussed attention mechanisms. The hybrid attention takes in consideration both the content and the location of inputs tokens, with presumably better results.
In addition to architectural differences, the important bit is that Tacotron2 uses Wavenet instead of Griffin-Lim to get back the audio signal which makes for very realistic sounding speech.
With additional references to https://medium.com/@rajanieprabha/tacotron-2-implementation-and-experiments-832695b1c86e
The loss function is the Summed mean squared error (MSE). In comparison to Tacotron-1 which uses simple summed L1 loss function (or MAE), we use (in Tacotron-2) a summed L2 loss function (or MSE) :-
(h(xi) stands for the model estimation)
L1 = Sum over i (yi-h(xi))
L2 = Sum over i (yi-h(xi))²
The L1 loss is typically computing the residual loss between your model’s predictions and the ground truth and returning the absolute value as is. The L2 loss however squares this error for each sample instead of simply returning the difference loss.
As per the paper https://arxiv.org/pdf/1803.09017.pdf, Google researchers have made attempts at exercising finer control over factors by creating embeddings for prosodic (semantic understanding which includes intonation, stress and rhythm) style and speakers (also in Baidu’s works), which they call Global Style Tokens (GST). The Style Tokens are randomly initialized, and compared against a ‘reference’ encoding (which is just a training audio example ingested by the reference encoder module) by means of attention, so that our audio example is now a weighted sum of all the style tokens.
The image below shows full Tacotron architecture for prosody control. The autoregressive decoder is conditioned on the result of the reference encoder, transcript encoder, and speaker embedding via an attention module.
As per the Tacotron 2 testing done by Rajani Prabhu (“https://medium.com/@rajanieprabha/tacotron-2-implementation-and-experiments-832695b1c86e”), the predicted and target mel spectrograms looked like these :-
As one can compare the upper regions in the above predicted image compared to target image, it has a lot of gaps and still needs a lot of training. The right side (solid green) is just padding in one batch.
Wish to be part of India’s No 1 AI Lab ? If yes, attend our meetups this Saturday (18 Jan ’20) in BLR / Gurugram. RSVP below :-
BLR AI Lab meetup :-
Register : https://www.meetup.com/Disrupt-4-0/events/qqmxlrybccbxb/
Topic : NER-NLP, 3D Reconstruction, Boosting
Presenters : Natarajan Lalgudi, Bhanumathi K., Ashwin Ravishankar
Gurugram AI Lab meetup :-
Register : https://www.meetup.com/Disrupt-4-0/events/267510048/
Topic : BERT with PyTorch, Word Embeddings, Image Classification
Presenters : Bhavesh Laddagiri, Akarsh Verma, Sonal Kukreja
See you this Saturday for the AI Lab meetup ! Let’s disrupt the world with AI, together !
Questions ? Call me at +91-9742800566 !
Best Regards,
Vivek Singhal
Co-Founder & Chief Data Scientist, CellStrat
+91-9742800566