NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 is a 30-billion parameter language model that utilizes a hybrid Mixture-of-Experts (MoE) and Mamba-2 architecture. It is designed to provide a unified solution for both reasoning and general instruction-following tasks, activating approximately 3.5 billion parameters per token. The model is optimized for high-efficiency agentic workflows, long-context processing, and technical tasks such as coding and mathematics.
Architecture and Design
The model features a sophisticated hybrid design consisting of 52 total layers: 23 Mamba-2 layers, 23 MoE layers, and 6 Grouped-Query Attention (GQA) layers. Each MoE layer includes 128 routed experts and one shared expert, with six experts activated per token. This architectural blend aims to capture the long-range sequence modeling strengths of Mamba-2 while leveraging the precision of transformer-based attention and the computational efficiency of MoE routing.
Key Capabilities
A primary feature of the Nemotron-3 Nano is its configurable reasoning mode, which allows the model to generate intermediate reasoning traces before concluding with a final response. This "thinking" process can be toggled via chat templates to improve performance on complex logic puzzles or deep reasoning tasks. Additionally, the model supports a massive context window of up to 1,000,000 tokens, enabling the analysis of extensive document collections or entire software repositories.
Training and Multilingual Support
The model was trained from scratch on a corpus of approximately 25 trillion tokens, incorporating high-quality curated and synthetically generated data. It supports 20 natural languages, including English, German, Spanish, French, Italian, and Japanese, as well as over 40 programming languages. The training process utilized the Megatron-LM framework with a Warmup-Stable-Decay (WSD) learning rate schedule to ensure stability across its massive token budget.