NVIDIA logo
NVIDIA
Open Weights

Nemotron Cascade 2 30B A3B

Released Mar 2026

Intelligence
#133
Coding
#124
Context262K
Parameters30B

Nemotron Cascade 2 30B A3B is an open-weight language model developed by NVIDIA, utilizing a Mixture-of-Experts (MoE) architecture. It features a total of 30 billion parameters, of which 3 billion are active during any single inference pass. This design is engineered to achieve high "intelligence density," providing advanced reasoning and coding capabilities comparable to much larger frontier models while remaining efficient enough for deployment on localized hardware.

The model is particularly noted for its specialized performance in mathematics and competitive programming. It achieved gold-medal-level scores in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals. While it excels in reasoning-heavy tasks, it involves a strategic trade-off, with NVIDIA reporting lower performance in general-knowledge-intensive benchmarks compared to its reasoning and instruction-following strengths.

Architecture and Training

Nemotron Cascade 2 was developed through a post-training pipeline starting from the Nemotron-3-Nano-30B-A3B-Base. The training process utilizes Cascade RL, a sequential domain-wise reinforcement learning framework, and Multi-Domain On-Policy Distillation (MOPD). MOPD involves distilling the model from its own strongest domain-specific checkpoints to stabilize training and recover performance across varied tasks like STEM reasoning, tool calling, and structured output.

Capabilities and Features

The model supports two distinct operational modes: Instruct Mode for concise responses and Thinking Mode for complex reasoning. Thinking mode enables the model to perform extended chain-of-thought processing, which is crucial for solving difficult math and logic problems. It also features a substantial 1,000,000-token context window, allowing it to process massive datasets or long code repositories.

  • Prompting: The model follows the ChatML template. Thinking content is wrapped in <think> and </think> tags.
  • Control: To skip the thinking process and force an immediate direct response, users can prepend <think></think> to the beginning of the assistant's turn.
  • Agentic Workflows: It is optimized for agentic tasks, particularly in software engineering environments, supporting multi-turn tool use and reasoning traces.

Rankings & Comparison