Step-Video-T2V is a text-to-video foundation model developed by StepFun, featuring 30 billion parameters. It is capable of generating videos up to 204 frames in length, typically at resolutions around 544x992. The model is built on a Diffusion Transformer (DiT) architecture with 48 layers and 3D full attention, trained using flow-matching techniques to ensure temporal consistency and high visual quality.
A core component of the system is the Video-VAE, which provides deep compression ratios of 16x16 spatially and 8x temporally to optimize both training and inference efficiency. To ensure the model accurately follows complex instructions, it integrates two bilingual text encoders—Hunyuan-CLIP and Step-LLM—enabling robust support for both English and Chinese prompts.
The model underwent Direct Preference Optimization (DPO) to refine its visual output, effectively reducing artifacts and improving overall realism. Alongside the standard model, StepFun released Step-Video-T2V-Turbo, a distilled version designed for faster generation with significantly fewer inference steps while maintaining competitive quality.