Z-Image Base is an open-source text-to-image foundation model developed by Alibaba's Tongyi Lab. It is a 6-billion parameter model designed to provide high visual fidelity, strong prompt adherence, and creative flexibility. As the non-distilled core of the Z-Image family, it serves as a high-capacity checkpoint for advanced generation tasks and as a foundation for community-driven fine-tuning.
The model utilizes a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. In this design, text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to create a unified input stream, which enhances parameter efficiency. It incorporates a Qwen-based text encoder and uses the same Variational Autoencoder (VAE) as the Flux.1 model family.
Key Capabilities
Unlike distilled variants optimized for speed, Z-Image Base supports full Classifier-Free Guidance (CFG) and is highly responsive to negative prompts. This enables precise control over compositions and styles, ranging from hyper-realistic photography to intricate digital art. The model is notably strong in bilingual text rendering, accurately displaying complex English and Chinese characters within generated images.
Usage and Development
Z-Image Base is optimized for resolutions up to 2048x2048 and generally requires between 30 and 50 sampling steps with a recommended guidance scale of 3.0 to 5.0 for high-quality results. Its non-distilled nature makes it the preferred starting point for training specialized LoRAs, performing style transfers, or developing structural conditioning through ControlNet pipelines.