HiDream-E1.1 is an open-source, instruction-based image editing model developed by HiDream.ai. Built as an iterative update to the HiDream-E1 model and based on the HiDream-I1 foundation, it allows users to perform precise modifications on images using natural language instructions. Unlike traditional inpainting models that often require manual masking, HiDream-E1.1 interprets direct commands to add, remove, or transform elements while maintaining the integrity of the original image's unedited sections.
The model utilizes a Sparse Diffusion Transformer (DiT) architecture featuring a dynamic Mixture of Experts (MoE) mechanism. This design incorporates four distinct text encoders to enhance semantic understanding: OpenAI CLIP-L, OpenCLIP-bigG, T5-XXL, and Llama-3.1-8B-Instruct. The integration of these encoders allows the model to handle both tag-based prompts and complex natural language instructions with high alignment to user intent.
Key enhancements in version 1.1 include support for dynamic resolutions up to 1 megapixel and improved editing accuracy. The model introduces a refine_strength parameter, which allows users to balance the initial editing operation with an image-to-image refinement stage powered by the HiDream-I1-Full model. For optimal results, the model accepts direct instructions such as "convert the image into a Ghibli style," and it has demonstrated leading performance in global edits, style transfers, and object manipulation across benchmarks like EmuEdit and ReasonEdit.