GLM-Image is an open-source image generation model developed by Zhipu AI (Z.ai). It employs a hybrid architecture that integrates an autoregressive transformer with a diffusion decoder to address common limitations in instruction following and text rendering. By separating global semantic planning from local detail refinement, the model excels in "knowledge-intensive" tasks such as generating scientific diagrams, commercial posters, and multi-panel comics that require both logical layout and high-fidelity textures.
Architecture and Design
The model consists of 16 billion parameters in total. The primary components include a 9B autoregressive module (initialized from GLM-4-9B) and a 7B diffusion decoder based on a single-stream Diffusion Transformer (DiT) architecture. During generation, the autoregressive stage produces discrete visual tokens to define the image's layout and logic, while the diffusion stage decodes these tokens into high-resolution visual outputs. To enhance text accuracy, the model incorporates a Glyph-ByT5 text encoder that provides specific character-shape information.
Key Capabilities
GLM-Image supports both text-to-image and image-to-image tasks, including style transfer, background replacement, and identity preservation. It is notable for its performance in rendering multi-region and multi-language text, achieving a Word Accuracy score of 0.9116 on the CVTG-2K benchmark. The model natively supports variable aspect ratios and can generate images with resolutions up to 2048px. It was trained using a decoupled reinforcement learning strategy (GRPO) to optimize both semantic alignment and visual appeal.