Step1X-Edit by StepFun: Benchmarks, Rankings & Model Details

Step1X-Edit is an open-source image editing framework developed by the Chinese AI startup StepFun (阶跃星辰). Designed as a general-purpose instruction-based editing tool, it allows users to modify images using natural language commands rather than manual masking or complex selection tools. The model is engineered to provide performance comparable to proprietary systems like GPT-4o and Gemini 1.5 Flash in both localized and global image manipulation tasks.

The architecture of Step1X-Edit is characterized by a hybrid system that integrates a Multimodal Large Language Model (MLLM) with a Diffusion Transformer (DiT) decoder. Specifically, the framework utilizes a Qwen-VL based model to parse both the reference image and the text instructions into latent embeddings. These embeddings are then processed through a specialized connector module, known as a token refiner, and passed to the DiT which generates the edited target image. This dual-component approach enables high-fidelity results that maintain the structural integrity of the original image while precisely following semantic instructions.

Step1X-Edit is optimized to handle eleven distinct editing categories, including subject addition or removal, background replacement, color and material modification, text modification, and complex motion changes. The training process involved a large-scale dataset of over one million high-quality triplets—comprising the source image, editing instruction, and target image—ensuring the model generalizes well across diverse real-world scenarios. Alongside the model, the team introduced GEdit-Bench, a benchmark rooted in authentic user instructions to evaluate image editing performance more comprehensively.

Step1X-Edit

Explore AI Studio

Rankings & Comparison

Step1X-Edit

Explore AI Studio

Rankings & Comparison