NVIDIA Nemotron Nano 9B V2 is a 9-billion parameter large language model developed by NVIDIA, designed as a unified system for both reasoning and standard instruction-following tasks. This version is a pruned and distilled variant of the Nemotron-Nano-12B-v2 architecture, optimized for high efficiency and low-latency performance on edge devices. The model is specifically identified as a "non-reasoning" variant when configured via system prompts to suppress intermediate reasoning traces, providing direct answers for general-purpose language tasks while maintaining high accuracy.
Architecture and Capabilities
The model utilizes a hybrid Mamba-2 and Transformer architecture, which combines the linear scaling efficiency of Mamba-2 with a limited number of attention layers to optimize inference speed and memory usage. This design allows the model to handle a large context window of up to 128,000 tokens on a single GPU. It was trained on approximately 20 trillion pre-training tokens and further refined with 1 trillion post-training tokens, incorporating a significant amount of high-quality synthetic data for alignment.
Performance and Use Cases
Optimized for deployment in AI agent systems, chatbots, and RAG (Retrieval-Augmented Generation) applications, the Nemotron Nano 9B V2 provides support for 15+ languages, including English, German, French, Spanish, Japanese, and Chinese. Its efficiency makes it suitable for on-device applications where computational resources are constrained, serving as a versatile tool for text summarization, coding assistance, and creative writing.