Qwen-Image-Edit is a 20-billion parameter image editing foundation model developed by Alibaba’s Qwen team. Building upon the Qwen-Image foundation, it is specifically optimized for high-fidelity manipulation of existing images using natural language instructions. The model is designed to bridge the gap between simple generation and precise, instruction-based visual editing.
The model utilizes a Multimodal Diffusion Transformer (MMDiT) architecture featuring a dual-path input system. It processes images simultaneously through Qwen2.5-VL for visual semantic control and a VAE encoder for visual appearance control. This architecture allows the model to distinguish between high-level meaning and low-level pixel details, enabling it to perform complex tasks like object rotation, identity-consistent modification, and style transfer without losing original image fidelity.
Qwen-Image-Edit supports two primary editing paradigms: Appearance Editing and Semantic Editing. Appearance editing focuses on pixel-perfect changes such as adding, removing, or modifying specific elements while keeping the rest of the image unchanged. Semantic editing handles more creative transformations that require maintaining conceptual consistency, such as changing an object's pose or environment. Additionally, it features advanced bilingual text rendering, allowing users to precisely add, delete, or modify text in both Chinese and English while matching the original typography and layout.
Since its initial release, the model has undergone iterative updates, including versions such as 2509 and 2511, which introduced multi-image editing capabilities and improved character consistency. It has demonstrated competitive performance on benchmarks like GEdit and ImgEdit, particularly excelling in tasks requiring complex text-image composition and Chinese language rendering.