Wan 2.5 Preview is a multimodal generative model family developed by Alibaba, designed for high-fidelity visual synthesis. Built on a unified architecture, the model integrates text-to-image, text-to-video, and image-to-video capabilities into a single framework. A core innovation of the model is its native audio-visual synchronization, which enables it to generate visuals alongside perfectly aligned audio—including lip-synced dialogue, music, and sound effects—within a single generation pass.
For image generation, Wan 2.5 focuses on enhanced instruction following and multilingual text rendering. It is capable of producing photorealistic images, artistic styles, and detailed charts from complex prompts. The model utilizes a Mixture-of-Experts (MoE) architecture and has been refined using Reinforcement Learning from Human Feedback (RLHF) to improve visual coherence, temporal stability, and adherence to specific cinematic shot instructions.
The model also introduces advanced image editing and fusion features. It supports single-image editing, where subjects are preserved through text-based modifications, and multi-image fusion, allowing elements from up to three reference images to be combined into a new composition. In its video-centric applications, it supports up to 1080p resolution at 24fps for durations of up to 10 seconds, featuring sophisticated camera movements such as dolly and crane shots.