Step TTS 2 is a high-quality text-to-speech model developed by StepFun, designed for expressive and natural voice synthesis. It is part of the Step-Audio 2 family, a suite of multimodal models that unify audio understanding and generation into a single architecture. The model is optimized for hyper-realism and emotional mastery, providing granular control over vocal characteristics such as pitch, rhythm, and style.
Departing from traditional cascaded ASR and TTS pipelines, Step TTS 2 utilizes a large language model (LLM) framework to generate speech directly. This architecture enables the model to handle complex paralinguistic features, including emotional tones like joy or sadness, regional dialects, and unique vocalizations such as humming or rapping. It supports multiple languages, including Chinese and English, while maintaining low-latency performance suitable for real-time interaction.
Capabilities and Architecture
The model incorporates an instruction-driven fine control system, allowing users to dynamically adjust speech rates and emotional intensity through textual prompts. It utilizes a dual-codebook framework for tokenization, which processes parallel semantic and acoustic information to ensure stability and clarity in the synthesized waveforms. In industry benchmarks like SEED-TTS, Step TTS 2 has demonstrated high speaker similarity and low error rates, particularly in capturing the nuances of varied conversational scenarios.