Qwen Image Edit Plus 2511 is an instruction-driven image editing model developed by Alibaba's Qwen team. It is designed to perform precise, natural language-based manipulations on existing images, allowing for complex modifications while maintaining high fidelity to the original content. The model represents a significant update over previous iterations, specifically addressing issues like identity preservation and localized control.
Built on a 20-billion parameter Multi-Modal Diffusion Transformer (MMDiT) architecture, the model utilizes a dual-encoding system. It leverages Qwen2.5-VL to provide visual semantic control and a VAE encoder for visual appearance management. This design enables the model to interpret nuanced textual instructions while ensuring the edited regions blend seamlessly with the untouched parts of the image.
A primary focus of the 2511 release is the improvement of multi-person consistency and identity preservation. The model can combine separate portraits into a coherent group shot and execute imaginative edits on subjects without losing their distinct visual characteristics. It also features reduced image drift, which prevents unwanted changes or quality degradation in areas not specified by the user's prompt.
In addition to general editing, the model supports integrated community LoRAs for specialized tasks such as viewpoint generation and lighting adjustments. It also excels in bilingual text editing (English and Chinese), with the ability to add, delete, or modify text in images while matching the original typography and style. Further optimizations include enhanced geometric reasoning and support for industrial design generation.