Wan 2.1 14B is a large-scale video generation model developed by Alibaba's Wan team, designed to produce high-quality cinematic content from text and image prompts. As the flagship variant of the Wan 2.1 suite, it features 14 billion parameters and is optimized for generating videos at resolutions up to 720p with high-fidelity motion dynamics and temporal consistency.
The model utilizes a Diffusion Transformer (DiT) architecture combined with a Flow Matching framework. It incorporates a novel 3D Causal Variational Autoencoder (Wan-VAE), which allows for efficient spatio-temporal compression and the ability to handle videos of significant length while maintaining spatial detail. For text understanding, it employs a UMT5 encoder, enabling the model to process complex instructions and produce native bilingual text in both Chinese and English realistically embedded within generated frames.
Wan 2.1 14B is distinguished by its performance in benchmarks such as VBench, particularly in areas like motion smoothness, spatial accuracy, and multi-object interaction. It supports multiple generative tasks, including text-to-video and image-to-video, and is capable of producing realistic physics and detailed textures at a native 16 frames per second.