NVIDIA logo
NVIDIA

Magpie-Multilingual 357M (Feb 2026)

Released Feb 2026

AA Arena
#31
Parameters357M

NVIDIA Magpie-Multilingual 357M is an end-to-end neural text-to-speech (TTS) model designed to generate natural and expressive speech across multiple languages. Part of the Magpie series, this model focuses on efficient cross-lingual instruction following at a compact scale, utilizing approximately 357 million parameters. It is built to support a wide range of use cases, from streaming voice agents to offline speech generation.

Architecture and Design

The model utilizes a transformer encoder-decoder architecture consisting of a 6-layer causal transformer encoder and a 12-layer causal transformer decoder. It operates by predicting discrete audio codec tokens autoregressively across eight parallel codebooks. These tokens are subsequently converted into high-fidelity speech waveforms using a frozen, pretrained neural audio codec called NanoCodec, which runs at 22 kHz. To ensure high-quality generation and alignment, the model incorporates Classifier-Free Guidance (CFG) and Group Relative Policy Optimization (GRPO).

Capabilities and Multilingual Support

Magpie-Multilingual 357M supports nine distinct languages: English (US), Spanish (European), German, French, Italian, Vietnamese, Mandarin Chinese, Hindi, and Japanese. It provides multiple voice profiles, typically including at least one male and one female speaker per language, with options for various emotional tones. The model includes built-in text normalization for most supported languages to handle numbers, abbreviations, and special characters correctly.

A key focus of the architecture is the mitigation of common artifacts found in LLM-based speech models, such as hallucinations, skipped words, or repeated phrases. By employing CTC (Connectionist Temporal Classification) loss and attention priors, the model enforces monotonic cross-attention between the input text and the generated audio. Additionally, while the model is optimized for complete utterances, it supports a sliding window mechanism for stable long-form inference on extended text inputs.

Rankings & Comparison