Step TTS Mini by StepFun: Benchmarks, Rankings & Model Details

Step TTS Mini (also known as Step-Audio-TTS-3B) is a high-efficiency text-to-speech (TTS) model developed by the Chinese AI startup StepFun (阶跃星辰). Designed as a resource-optimized component of the broader Step-Audio ecosystem, the model provides expressive speech synthesis with low latency, making it suitable for real-time interaction and high-volume content generation. It is designed to bridge the gap between high-fidelity audio quality and the efficiency required for edge or localized deployment.

The model is characterized by its emotion perception and voice cloning capabilities, which allow it to generate audio that mimics specific speaker timbres and emotional nuances with minimal reference samples. It natively supports mixed-language synthesis, specifically handling seamless transitions between Chinese and English with natural prosody. The system utilizes a hybrid decoding architecture that combines flow matching with neural vocoding to ensure stable and realistic waveform generation.

Technically, Step TTS Mini utilizes approximately 3 billion parameters and was trained using a generative data engine that leverages StepFun's larger foundational models to produce high-quality synthetic training data. This approach allows the compact model to inherit the stylistic nuances of larger multimodal systems while maintaining a manageable computational footprint. The model was released alongside an open-source inference framework under the Apache 2.0 license.

Step TTS Mini

Explore AI Studio

Rankings & Comparison