NVIDIA Magpie Multilingual (also known as Magpie TTS Multilingual) is a generative text-to-speech (TTS) model designed for natural, multilingual speech synthesis. It operates as part of a two-stage pipeline, typically consisting of a transformer encoder-decoder model that predicts discrete acoustic tokens from text, followed by a neural audio codec—such as NVIDIA's NanoCodec—to reconstruct high-fidelity waveforms. The model supports several languages, including English, Spanish, French, German, Vietnamese, Italian, and Mandarin.
To ensure robust speech generation, the architecture utilizes monotonic alignment techniques, such as CTC loss and attention priors, which help the model maintain strict text adherence and prevent issues like repeated or skipped words. It also incorporates Classifier-Free Guidance (CFG) to balance speaker similarity and audio quality, alongside alignment strategies like Group Relative Policy Optimization (GRPO) for improved performance across diverse linguistic contexts.
The system is optimized for real-time and streaming applications, supporting multiple voices and emotional tones. It utilizes the International Phonetic Alphabet (IPA) for its internal tokenization and training across most supported languages, allowing it to handle varied phonetic structures while maintaining consistent pronunciation and prosody.