Step Audio EditX by StepFun: Benchmarks, Rankings & Model Details

Step-Audio-EditX is an open-source, 3-billion-parameter audio model developed by StepFun, designed for expressive speech synthesis and iterative audio editing. Unlike traditional signal-processing tools that manipulate waveforms directly, this model treats speech as discrete tokens, allowing users to perform audio edits through high-level text-like operations. It supports a variety of tasks, including zero-shot text-to-speech (TTS), voice cloning, and fine-grained control over emotional tone, speaking style, and paralinguistic cues such as laughter or breathing.

The model's architecture utilizes a dual-codebook tokenizer that decomposes speech into two synchronized streams: a linguistic stream at 16.7 Hz and a semantic stream at 25 Hz. This tokenizer preserves prosodic and emotional information, which the 3B-parameter audio language model (LLM) then processes. The LLM is initialized from a pre-trained text model and trained on a blended corpus of text and audio tokens to bridge the gap between linguistic instructions and acoustic output.

StepFun employs a large-margin learning strategy and reinforcement learning techniques—including PPO and GRPO—to refine the model's ability to follow natural language editing prompts. This approach allows the model to achieve high expressivity and iterative control without requiring complex disentangling encoders. Performance evaluations on benchmarks like Step-Audio-Edit-Test demonstrate its capacity for stable, multi-round editing while maintaining the original speaker's timbre.

Step Audio EditX

Explore AI Studio

Rankings & Comparison

Step Audio EditX

Explore AI Studio

Rankings & Comparison