Stable Diffusion 3.5 Medium is a generative text-to-image model developed by Stability AI, designed to provide a balance between high-quality visual output and computational efficiency. Released as part of the Stable Diffusion 3.5 family, this model is specifically optimized to run on consumer-grade hardware, making it more accessible to hobbyists and researchers than its larger counterparts while maintaining competitive performance in prompt adherence and aesthetic detail.
Architecture and Capabilities
The model is built on an improved Multimodal Diffusion Transformer (MMDiT-X) architecture. It utilizes three fixed, pretrained text encoders—OpenCLIP-ViT/G, CLIP-ViT/L, and T5-XXL—to process textual information. Key architectural refinements include the implementation of QK-normalization to improve training stability and dual attention blocks in the initial transformer layers. These enhancements allow the model to better handle complex prompt instructions, spatial relationships, and typography compared to previous iterations in the series.
With approximately 2.5 billion parameters, the Medium variant is capable of generating images at resolutions ranging from 0.25 to 2 megapixels. It features improved training methods such as mixed-resolution training and self-attention modules in the first 13 layers, which contribute to better multi-resolution generation and overall image coherence.
Prompting and Usage Tips
Stability AI recommends using natural language descriptions for prompts rather than disconnected keywords to fully leverage the model's semantic understanding. While the model can process long prompts, users are advised to stay within a 256-token limit for the T5 encoder to avoid artifacts. For improved structural integrity and anatomical accuracy, the use of Skip Layer Guidance is recommended during sampling. The model is released under the Stability AI Community License, which allows for free commercial and non-commercial use by individuals and organizations with annual revenue below $1M.