OpenGVLab logo
OpenGVLab
Open Weights

Lumina Image v2

Released Jan 2025

AA Text→Image
#101
Parameters2.6B

Lumina-Image-2 is a high-resolution text-to-image generation model developed by OpenGVLab (Alpha-VLLM). Designed as a successor to the Lumina-Next series, the model utilizes a 2.6 billion parameter architecture to generate high-fidelity images up to 1024x1024 resolution. It is built to balance aesthetic quality and prompt adherence with computational efficiency, achieving results that rival significantly larger diffusion models.

Architecture and Technical Details

The model is based on the Unified Next-DiT framework, a flow-based Large Diffusion Transformer that treats text and image tokens as a joint sequence within a single-stream architecture. This design enables natural cross-modal interactions using a single set of parameters, rather than the separate streams used in traditional architectures. For text processing, Lumina-Image-2 integrates the Gemma-2-2B encoder, while the FLUX-VAE-16CH serves as the variational autoencoder for efficient latent space reconstruction. The training process leverages Flow Matching and a specialized Unified Captioner (UniCap) system to ensure precise alignment between visual outputs and complex natural language descriptions.

Key Capabilities

Lumina-Image-2 supports "any-resolution" generation and flexible aspect ratios, allowing for diverse creative applications ranging from photorealistic portraits to complex artistic illustrations. It demonstrates notable performance in rendering typography and managing long, descriptive prompts that require high logical reasoning. Additionally, the model focuses on inference efficiency, supporting multiple solvers such as Midpoint, Euler, and DPM to optimize the trade-off between speed and image quality.

Prompting and Best Practices

For optimal results, users are encouraged to provide detailed, natural language prompts that specify style, lighting, and composition. The model is particularly responsive to system-level instructions that define a specific artistic persona, such as "professional photographer" or "impressionist painter." Including clear descriptors for textures and environmental details helps the model leverage its high-resolution synthesis capabilities effectively.

Rankings & Comparison