MiMo-V2.5-TTS by Xiaomi: Benchmarks, Rankings & Model Details

MiMo-V2.5-TTS is a high-fidelity text-to-speech (TTS) model developed by Xiaomi, released as part of the MiMo-V2.5 foundation model series. It is designed to provide expressive and natural-sounding speech synthesis with advanced control over emotional states, prosody, and vocal styles. The model is built using a proprietary Audio Tokenizer with multi-codebook joint modeling, allowing it to handle complex vocal tasks including multi-dialect support (such as Wu, Cantonese, and Minnan) and singing voice synthesis.

The architecture focuses on "full-link" voice interaction, often deployed alongside the MiMo-V2.5-ASR model to create seamless voice-driven agent pipelines. A notable technical capability is its support for natural language style control, which enables users to adjust the synthesized output through descriptive instructions rather than fixed categories. This allows for nuanced transitions and complex emotion-mixing, such as "gentle but tired" or "repressed anger."

Xiaomi has also introduced a Voice Design variant of the model, which allows for the generation of entirely new custom voices from text descriptions alone. The model is optimized for high-speed inference and is a core component of Xiaomi's "Human x Car x Home" ecosystem, specifically targeting real-time interactive agents and virtual assistants. While the core MiMo-V2.5 language models were released under an open-source license, the specific TTS checkpoints are primarily available through official enterprise APIs and proprietary platform integrations.

MiMo-V2.5-TTS

Explore AI Studio

Rankings & Comparison

MiMo-V2.5-TTS

Explore AI Studio

Rankings & Comparison