Gemma 4 26B A4B (Non-reasoning) by Google: LLM Benchmarks, Rankings & Specs

Gemma 4 26B A4B is a sparse Mixture-of-Experts (MoE) language model released by Google DeepMind as part of the Gemma 4 family of open-weight models. Built using research derived from the Gemini 3 series, the model is designed to provide high intelligence-per-parameter by balancing a large total parameter count with an efficient active parameter footprint. The "A4B" designation refers to its approximately 4 billion active parameters during inference, which allows the model to deliver performance comparable to a 31B dense model while maintaining the speed and efficiency of a much smaller variant.

Architecture and Performance

The model architecture utilizes 128 total experts, with 8 active experts plus 1 shared expert per token. It incorporates a hybrid attention mechanism that alternates between local sliding-window attention and global full-context attention. To facilitate its 256,000-token context window, the model uses Proportional RoPE (p-RoPE) and unified Keys and Values (KV) in its global layers, minimizing quality degradation over long-range sequences. The model is released under a permissive Apache 2.0 license, making it suitable for commercial and research applications.

Capabilities and Multimodality

Gemma 4 26B A4B is natively multimodal, supporting text and image inputs with variable aspect ratio handling. It can also process video sequences (up to 60 seconds at 1 frame per second) by treating them as sequences of frames. The model supports over 140 languages and includes native support for system prompts, function calling, and structured outputs. While the Gemma 4 series introduces a built-in "Thinking Mode" for step-by-step reasoning, this is a toggleable feature; in its standard non-reasoning mode, the model provides direct, high-speed responses optimized for agentic workflows and general-purpose text generation.

Implementation Notes

For optimal performance, the recommended sampling configuration includes a temperature of 1.0, top_p of 0.95, and top_k of 64. When multimodal inputs are used, users are advised to place image content before text in the prompt. For long-context tasks exceeding 100,000 tokens, increasing the repeat penalty (e.g., to 1.17) and lowering the temperature can help prevent repetitive loops during generation. The unquantized weights are optimized for high-end server hardware, while quantized versions are specifically designed to run on consumer-grade GPUs.

Architecture and Performance

Capabilities and Multimodality

Implementation Notes

Gemma 4 26B A4B (Non-reasoning)

Architecture and Performance

Capabilities and Multimodality

Implementation Notes

Explore AI Studio

Rankings & Comparison

Gemma 4 26B A4B (Non-reasoning)

Architecture and Performance

Capabilities and Multimodality

Implementation Notes

Explore AI Studio

Rankings & Comparison