StepFun logo
StepFun

StepAudio 2.5 TTS

Released Apr 2026

StepAudio 2.5 TTS is a contextual text-to-speech model developed by StepFun, designed to provide high-fidelity, performance-oriented vocal control through natural language commands. Released in April 2026, the model shifts away from traditional tag-based systems toward a paradigm where users describe nuanced vocal deliveries—such as "restrained sadness" or "slight trembling"—using plain text. This allows for the generation of expressive speech that captures subtext and psychological states rather than just narrating text flatly.

The model features a dual-layer control architecture consisting of Global Context and Inline Context. Global context is used to establish the overall emotional tone and scene atmosphere for a full audio segment, ensuring coherence across multi-turn interactions. Inline context provides granular control over individual sentences, allowing the system to adjust rhythm, pauses, emphasis, and breathing patterns dynamically. It also supports zero-shot timbre replication, enabling high-quality voice cloning from a brief reference recording without requiring retraining.

Technically, StepAudio 2.5 TTS is built on a genuine end-to-end architecture that eliminates the need for separate ASR and LLM pipelines. By processing raw waveforms and using discrete acoustic tokens, it achieves low-latency synthesis while maintaining high stability. The model utilizes an interleaved modality alignment strategy, which interlaces text and speech tokens to maximize the cognitive performance of the underlying language model during the synthesis process.

According to official documentation, the model is optimized for several industrial applications, including emotional companionship, customer service, and professional video dubbing. It supports a variety of pre-configured voice profiles and offers real-time voice modulation capabilities, allowing for voice-command-triggered tone switching and high-emotion dialogue generation.

Rankings & Comparison