Google logo
Google
Open Weights

Gemma 3 12B Instruct

Released Mar 2025

Gemma 3 12B Instruct is a multimodal open model developed by Google, built using the same research and technology as the Gemini family. It is designed to natively process both text and image inputs within a unified transformer architecture, enabling it to understand spatial relationships and visual details more effectively than models using separate vision encoders. The "Instruct" variant is fine-tuned for conversational interactions, following complex instructions, and cross-modal reasoning. The model features a 128,000-token context window and supports more than 140 languages. Its native multimodal design allows for integrated tasks such as visual question answering, document analysis with embedded images, and sophisticated image-to-text generation. With 12 billion parameters, it is optimized to provide a balance between computational efficiency and high performance in tasks like coding, mathematics, and logic. Gemma 3 12B was trained on a dataset of 12 trillion tokens, incorporating a wide variety of web documents, code, and visual data. It utilizes a SigLIP-based vision encoder and a new tokenizer optimized for multilingual performance, making it suitable for global applications and agentic systems that require both visual and textual understanding.

Rankings & Comparison