Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) by NVIDIA: LLM Benchmarks, Rankings & Specs

Llama 3.3 Nemotron Super 49B v1 is a large language model developed by NVIDIA as a derivative of Meta's Llama 3.3 70B Instruct. It was designed to provide a balance between accuracy and efficiency, specifically optimized to fit on a single NVIDIA H100 or H200 GPU. The model was created using a novel Neural Architecture Search (NAS) approach, which reduced the parameter count from 70B to approximately 49B while maintaining performance levels close to the original reference model.

The architecture features a dense decoder-only Transformer design with non-standard and non-repetitive blocks. Through the NAS process, certain attention layers are skipped or replaced with linear layers, and the expansion ratios in the Feed-Forward Network (FFN) layers vary between blocks. This customization allows for significantly higher throughput and a reduced memory footprint compared to standard Llama 3.3 architectures.

The model is a dual-mode system that supports both intensive reasoning and standard instruction-following tasks. Users can toggle between these modes via a specific system prompt; for non-reasoning applications, the model functions as a high-efficiency general-purpose assistant suitable for chat, Retrieval-Augmented Generation (RAG), and tool calling. In this non-reasoning mode, NVIDIA recommends greedy decoding for optimal performance.

Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)

Explore AI Studio

Rankings & Comparison

Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)

Explore AI Studio

Rankings & Comparison