MiniMax logo
MiniMax

Speech 2.8 Turbo

Released Jan 2026

MiniMax Speech 2.8 Turbo is a high-performance text-to-speech (TTS) model designed for low-latency, emotionally expressive audio generation. It serves as a speed-optimized variant within the Speech 2.8 family, prioritizing rapid inference for real-time applications such as interactive voice agents, gaming, and live content previews. The model is characterized by its ability to maintain high audio fidelity and natural prosody while operating with a processing latency of under 250 milliseconds.

The model's architecture is based on an autoregressive Transformer combined with a learnable speaker encoder. Unlike traditional TTS systems that rely on mel-spectrogram vocoders, Speech 2.8 Turbo utilizes a hybrid Flow-VAE decoder to model speech within a learned latent space. This technical foundation allows the model to produce audio with human-like cadence and tonal nuance, and it supports zero-shot voice cloning by extracting timbre features from reference audio samples as short as 10 seconds.

Key Capabilities

MiniMax Speech 2.8 Turbo supports over 40 languages and provides access to a library of more than 300 system voices. A significant feature of this version is the support for natural interjections, allowing the insertion of non-verbal sounds such as (laughs), (sighs), (gasps), and (chuckle) directly into the text prompt for more lifelike delivery. Users can also control specific speech parameters, including playback speed (0.5x to 2.0x), volume, pitch, and emotional tone presets such as happy, sad, angry, or surprised.

Prompting and Best Practices

To achieve the most natural results, users are encouraged to write out numbers and dates fully (e.g., "March fifteenth" rather than "3/15"). The model supports specialized markers for pause control, such as <#0.5#> for a half-second silence, and includes a pronunciation dictionary to handle brand names or technical acronyms with precise phonetic control. For streaming applications, the model supports WebSocket interfaces to minimize time-to-first-byte during generation.

Rankings & Comparison