Step1X-Edit-v1p2 (also known as ReasonEdit-S) is a reasoning-enhanced open-source image editing model developed by StepFun. It is designed to provide image manipulation capabilities comparable to advanced proprietary models by integrating native reasoning processes into the editing workflow. The model represents a significant evolution in instruction-based image editing, focusing on improved comprehension of complex or abstract user prompts.
The model utilizes a hybrid architecture that combines a Multimodal Large Language Model (MLLM) with a Diffusion Transformer (DiT). This structure enables a unique thinking–editing–reflection loop: during the thinking stage, the MLLM leverages internal world knowledge to interpret and reformat instructions; the DiT then performs the edit; finally, a reflection stage reviews the output to correct unintended artifacts and ensure high-fidelity results. This approach allows the model to handle multi-step reasoning tasks that standard diffusion models often struggle with.
With approximately 19 billion parameters, Step1X-Edit-v1p2 supports a wide variety of tasks, including object addition and removal, background modification, style transfer, and character consistency maintenance. It has demonstrated state-of-the-art performance on benchmarks such as GEdit-Bench and KRIS-Bench, particularly in scenarios requiring precise regional control and logical consistency. Users can typically toggle specific thinking and reflection modes during inference to optimize the trade-off between speed and editing accuracy.