Stable Diffusion 2.1 is a latent text-to-image diffusion model developed by Stability AI as an iterative improvement over the initial version 2.0 release. It is designed to generate high-resolution images with improved color rendering and structural consistency. The model was trained on a filtered subset of the LAION-5B dataset, utilizing a refined training process to reduce the generation of NSFW content and enhance the aesthetic quality of output compared to earlier iterations.
Technically, the model utilizes the OpenCLIP-ViT/H-14 text encoder, which offers a different semantic understanding than the CLIP encoders used in version 1.5. This version supports native image generation at both 512x512 and 768x768 pixel resolutions. It also introduced improved support for negative prompting, allowing users to explicitly define elements to be excluded from the final image, such as distortions or unwanted objects.
Model Variations and Capabilities
In addition to the standard text-to-image model, the Stable Diffusion 2.1 release includes specialized checkpoints for specific tasks. These include an in-painting model optimized for editing and filling missing parts of images, and a depth-to-image model that uses depth information from an input image to guide the spatial structure of the generation. A 4x upscaler model was also released alongside the base version to enhance the resolution of generated or existing images while preserving fine details.