Stability.ai logo
Stability.ai
Open Weights

Stable Diffusion 3 Medium

Released Jun 2024

Stable Diffusion 3 Medium is a Multimodal Diffusion Transformer (MMDiT) text-to-image model developed by Stability AI. Released in June 2024, it is the first open-weights release of the Stable Diffusion 3 family and is specifically designed to balance high-quality output with resource efficiency, making it suitable for consumer-grade hardware. The model was trained on a dataset of approximately 1 billion images and fine-tuned with high-quality aesthetic and preference data.

The model architecture utilizes approximately 2 billion parameters for its core transformer blocks. Unlike previous versions that used a U-Net backbone, this model employs the MMDiT framework, which uses separate sets of weights for image and text representations. This bidirectional information flow between modalities improves the model's ability to interpret complex prompts and handle intricate spatial relationships. It also incorporates a refined flow-matching formulation, which facilitates the generation of high-quality images with fewer sampling steps than traditional noise-prediction methods.

Key capabilities of Stable Diffusion 3 Medium include significantly improved typography and text-rendering, allowing for legible text generation within images. To achieve deep linguistic understanding, the model leverages three distinct text encoders: two CLIP models and a T5-XXL model. This configuration helps the model overcome common generative artifacts, particularly in rendering hands and faces, while maintaining high fidelity to specific prompt instructions. The model natively supports multiple aspect ratios and is optimized for 1024x1024 resolution.

Rankings & Comparison