Stable Diffusion 1.5 is a latent text-to-image diffusion model that generates images from text descriptions. It was developed as a collaborative effort involving Runway, CompVis at LMU Munich, and Stability AI. The model is part of the v1 series and serves as a foundational weights release for many derivative models in the open-source community. It utilizes a latent diffusion architecture to perform image generation more efficiently than standard pixel-space diffusion models.\n\n## Architecture and Training\nThe model architecture is centered around a U-Net with 860 million parameters and uses a frozen CLIP ViT-L/14 text encoder. This allows the model to align textual concepts with visual representations effectively. Stable Diffusion 1.5 was trained on a subset of the LAION-5B dataset and specifically fine-tuned for improved image quality and stability compared to version 1.4. The model converts latent representations into 512x512 pixel images using a Variational Autoencoder (VAE).\n\n## Capabilities and Prompting\nThe model supports various tasks, including text-to-image, image-to-image, and inpainting. To achieve the best results, users typically use specific, descriptive prompts rather than conversational sentences. Using negative prompts is a standard practice to filter out undesirable features. The model's open nature has allowed for extensive fine-tuning and the development of tools like ControlNet, which provide more granular control over the generated visual structures.