Step-Audio-EditX is an open-source 3-billion parameter audio editing model developed by StepFun. It utilizes a large language model (LLM) framework to transform traditional audio signal processing into high-level token operations, making speech editing as intuitive as text editing. The model is designed to handle expressive and iterative modifications, allowing users to control vocal attributes like emotion, speaking style, and paralinguistic cues without the need for complex waveform manipulation.
The architecture is built upon a dual-codebook tokenizer that decomposes speech into two distinct streams: a linguistic stream (16.7 Hz) and a semantic stream (25 Hz). This system enables the 3B LLM, which is initialized from a text-based foundation, to process speech as discrete tokens. The model employs a large-margin learning objective during training, which avoids the limitations of traditional representation-level disentanglement and allows for precise control over nuances like laughter, breathing, and specific accents.
Key capabilities of Step-Audio-EditX include high-fidelity zero-shot voice cloning and multi-round iterative editing. Users can apply consecutive adjustments—such as changing a speaker's emotion from neutral to fearful and then adjusting the speaking pace—while maintaining the original voice's identity. The model supports several languages, including Mandarin, English, Japanese, and Korean, and can also be used as a post-processing layer to enhance the emotional expressivity of other text-to-speech systems.
Training involved a combination of supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), specifically utilizing Proximal Policy Optimization (PPO) with token-level reward models. To benchmark its performance, the developers introduced Step-Audio-Edit-Test, an evaluation framework that uses frontier LLMs as judges to measure the accuracy and naturalness of stylistic and emotional edits across diverse speaker profiles.