Llama 3.3 Nemotron Super 49B v1 is a large language model developed by NVIDIA as a derivative of Meta's Llama 3.3 70B Instruct. It was designed to provide a balance between accuracy and efficiency, specifically optimized to fit on a single NVIDIA H100 or H200 GPU. The model was created using a novel Neural Architecture Search (NAS) approach, which reduced the parameter count from 70B to approximately 49B while maintaining performance levels close to the original reference model.
The architecture features a dense decoder-only Transformer design with non-standard and non-repetitive blocks. Through the NAS process, certain attention layers are skipped or replaced with linear layers, and the expansion ratios in the Feed-Forward Network (FFN) layers vary between blocks. This customization allows for significantly higher throughput and a reduced memory footprint compared to standard Llama 3.3 architectures.
The model is a dual-mode system that supports both intensive reasoning and standard instruction-following tasks. Users can toggle between these modes via a specific system prompt; for non-reasoning applications, the model functions as a high-efficiency general-purpose assistant suitable for chat, Retrieval-Augmented Generation (RAG), and tool calling. In this non-reasoning mode, NVIDIA recommends greedy decoding for optimal performance.