Janus-Pro is a unified multimodal understanding and generation model developed by DeepSeek. It represents a significant advancement over the original Janus framework, designed to perform both visual reasoning and high-fidelity text-to-image synthesis within a single autoregressive transformer architecture. Unlike many multimodal models that use a shared visual encoder for all tasks, Janus-Pro employs a decoupled visual encoding strategy to prevent performance trade-offs between understanding and generation.
The architecture utilizes a specialized SigLIP-L vision encoder to extract high-dimensional semantic features for image understanding and a discrete VQ tokenizer for image generation. These separate pathways map visual information into a shared input space, where a core transformer backbone processes multimodal sequences. This design allows the model to achieve high performance in diverse tasks, including visual question answering (VQA), chart and document analysis, and complex instruction-following for creative synthesis.
Janus-Pro was trained on significantly expanded datasets, incorporating approximately 90 million samples for multimodal understanding and 72 million high-quality synthetic aesthetic samples for generation. This balanced data approach improves the stability of generated outputs and enhances the model's ability to follow precise spatial and stylistic prompts. The model is released in two primary sizes: a lightweight 1B version (based on a 1.5B base) and a more capable 7B version.
For text-to-image generation, the model follows an autoregressive process where it predicts visual tokens based on text inputs. DeepSeek recommends using specific prompt formats for optimal results, typically involving the prefix "Generate an image: " followed by a descriptive prompt. The model demonstrates strong competitive performance in benchmarks such as GenEval and DPG-Bench, showing particular strength in single-object accuracy and positional alignment.