Step TTS 2 is a generative text-to-speech model developed by StepFun, released in March 2026. It serves as the dedicated speech synthesis component of the Step-Audio 2 family, featuring a unified architecture that integrates audio tokenization directly into the language modeling process. This end-to-end approach allows the model to maintain semantic and prosodic alignment across diverse vocal outputs, eliminating the need for traditional cascaded pipelines that separate speech-to-text and synthesis stages.
The model is capable of generating high-fidelity, context-aware speech across multiple languages and dialects, including Chinese, English, Japanese, Cantonese, and Sichuanese. It supports a wide range of paralinguistic controls, enabling users to generate voices with specific emotional states such as joy, sadness, or excitement, as well as complex vocal styles like whispering, rap, and a cappella humming.
A significant feature of Step TTS 2 is its support for Multimodal Retrieval-Augmented Generation (RAG). This technology enables the model to perform "Audio Search," allowing it to mimic specific speaking styles or switch vocal timbres dynamically by retrieving and fusing voice features from a library of over 50,000 speaker samples. This provides advanced capabilities for zero-shot voice cloning and stylistic imitation during inference.
Based on the Step-Audio 2 framework, the model utilizes a latent space audio encoder and a flow-matching-based speech decoder. It is designed for low-latency, industrial applications, providing robust speech interaction and understanding. The model weights for the 8B "mini" variant are released under an open-source license, while larger versions support deeper reasoning and complex conversational contexts.