Magpie-Multilingual 357M by NVIDIA: Benchmarks, Rankings & Model Details

Magpie-Multilingual 357M (Multilingual Aligned Generative Phoneme-to-audio Inference Engine) is a text-to-speech (TTS) model developed by NVIDIA for generating natural and expressive speech across multiple languages. The model employs a two-stage pipeline architecture consisting of a language model that generates discrete acoustic tokens and a neural audio codec, such as NanoCodec, which decodes those tokens into high-fidelity audio waveforms.

The first stage uses a Transformer encoder-decoder architecture (specifically the T5-TTS framework) to predict acoustic tokens autoregressively from input text. It supports seven languages: English, Spanish, German, French, Vietnamese, Italian, and Mandarin. The model features five distinct voices—Sofia, Aria, Jason, Leo, and John Van Stan—and is capable of producing speech with varying emotional tones and gender characteristics for several supported locales.

To improve audio quality and text adherence, the model utilizes advanced training and inference techniques, including Classifier-Free Guidance (CFG), attention priors, and Group Relative Policy Optimization (GRPO) for preference alignment. The system is designed for low-latency performance in real-time applications such as voice AI agents, digital humans, and interactive voice response (IVR) systems. It is integrated into the NVIDIA NeMo framework and governed by the NVIDIA Community Model License.

Magpie-Multilingual 357M

Explore AI Studio

Rankings & Comparison

Magpie-Multilingual 357M

Explore AI Studio

Rankings & Comparison