Step-1X-Edit-v1p2-preview is an advanced, instruction-based image editing model developed by StepFun (阶跃星辰). It is designed to perform precise, semantic-level image manipulations using natural language descriptions, enabling users to add, remove, or modify elements within an image while maintaining high style consistency and subject identity. The model represents an effort to provide open-source tools capable of competing with proprietary multimodal systems in complex image editing tasks.
The architecture of the model is built on a hybrid framework that integrates a Multimodal Large Language Model (MLLM) with a Diffusion Transformer (DiT) decoder. The MLLM component parses the semantic nuances of user instructions and the reference image, generating latent embeddings that guide the DiT-based generation process. This structure allows for superior spatial control and fine-grained modifications compared to standard text-to-image models that lack deep instruction-following capabilities.
A defining technical feature of the v1p2-preview release is the implementation of a thinking-editing-reflection loop. During the thinking phase, the model uses its reasoning capabilities to interpret abstract instructions and plan the edit. The reflection stage then reviews the generated output to identify and correct any unintended distortions or inaccuracies. This iterative reasoning mechanism significantly improves the model's performance on benchmarks such as KRIS-Bench and GEdit-Bench.