Voxtral TTS by Mistral: Benchmarks, Rankings & Model Details

Voxtral TTS is an open-weight text-to-speech model developed by Mistral AI, serving as the generative output component of the Voxtral speech stack. Released in March 2026, the model features a 4 billion parameter architecture designed for low-latency, high-fidelity audio generation. It supports nine major languages—English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic—and is capable of producing emotionally expressive speech that captures natural pauses, rhythm, and intonation.\n\nThe model utilizes a hybrid architecture consisting of three core modules: a 3.4B parameter transformer decoder (built on the Ministral 3B backbone) for semantic text interpretation, a 390M parameter flow-matching acoustic transformer, and a 300M parameter neural audio codec for final waveform synthesis. This modular design allows the model to separate the linguistic content from the acoustic texture of the voice, enabling zero-shot voice cloning with as little as three seconds of reference audio.\n\nVoxtral TTS is optimized for real-time applications and edge deployment. It achieves a model latency of approximately 70ms to 90ms for typical inputs and a real-time factor of roughly 9.7x, making it fast enough for streaming interactive voice agents. A notable capability is its zero-shot cross-lingual adaptation, which allows it to generate speech in one language using the accent and vocal characteristics of a reference prompt from a different language.\n\nThe weights for Voxtral TTS are released under the CC BY-NC 4.0 license for research and non-commercial use, marking a shift from the Apache 2.0 license used for the transcription-focused Voxtral models. In human preference benchmarks reported by Mistral, the model achieved a 68.4% win rate against proprietary low-latency competitors in zero-shot voice cloning tasks.

Voxtral TTS

Explore AI Studio

Rankings & Comparison