This repository provides a script and recipe to train Tacotron 2 and WaveGlow v1.6 models to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
This text-to-speech (TTS) system is a combination of two neural network models:
- a modified Tacotron 2 model from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper
- a flow-based neural network model from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper
The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.
Our implementation of Tacotron 2 models differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers. Also, the original text-to-speech system proposed in the paper uses the WaveNet model to synthesize waveforms. In our implementation, we use the WaveGlow model for this purpose.
The Tacotron 2 and WaveGlow model enables you to efficiently synthesize high quality speech from text.
Both models are trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2.0x faster for Tacotron 2 and 3.1x faster for WaveGlow than training without Tensor Cores, while experiencing the benefits of mixed precision training. The models are tested against each NGC monthly container release to ensure consistent accuracy and performance over time.