Wan 2.5 Preview is an advanced video generation model developed by Alibaba's Wan team, representing a significant update to the earlier Wan 2.1 suite. A key feature of this version is its native audio-visual synchronization, which allows the model to generate high-quality video alongside perfectly timed audio—including speech with lip-sync, ambient sounds, and background music—in a single inference pass. The model supports both text-to-video and image-to-video generation, with a focus on cinematic quality and realistic motion.
Technically, Wan 2.5 Preview utilizes a Diffusion Transformer (DiT) architecture and is capable of producing videos at resolutions up to 1080p with durations of 5 to 10 seconds. It shows marked improvements in semantic understanding and prompt following, enabling finer control over camera movements, lighting consistency, and complex physical interactions. The model is optimized for professional production workflows, aiming to reduce the need for manual post-production synchronization.
Initially launched as a cloud-based preview on Alibaba's DashScope platform and partner API providers, the model is designed to compete with high-end proprietary systems by offering superior visual fidelity and temporal stability. It maintains the character and scene consistency established in previous iterations while enhancing overall detail and motion fluidity.