Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) is a compact large language model developed by NVIDIA, designed to provide high-accuracy reasoning capabilities within a small parameter footprint. It is a derivative of the Llama-3.1-Minitron-4B-Width-Base, created through the application of NVIDIA's Minitron compression techniques—including pruning and knowledge distillation—on the Llama 3.1 8B architecture.
The model is specialized for complex logical tasks, having undergone a multi-phase post-training process that includes Supervised Fine-Tuning (SFT) and Reward-aware Preference Optimization (RPO). These optimizations focus on enhancing performance in mathematics, coding, and tool-calling, allowing the model to function effectively in agentic AI workflows. It supports both "Reasoning On" and "Reasoning Off" modes to balance logical depth with response speed.
Equipped with a context window of 131,072 tokens, the model can process large datasets for tasks such as retrieval-augmented generation (RAG). Its 4-billion parameter size is specifically tailored for local deployment on edge devices and consumer-grade hardware, facilitating private and low-latency inference without requiring data center-scale resources.