Llama Nemotron Super 49B v1.5 (Non-reasoning) by NVIDIA: LLM Benchmarks, Rankings & Specs

Llama-3.3-Nemotron-Super-49B-v1.5 is a large language model developed by NVIDIA that utilizes a derivative architecture of Meta's Llama 3.3 70B. It was created using Neural Architecture Search (NAS) and block-wise distillation to reduce the model's memory footprint while maintaining the performance levels of a 70B-class model. The architecture is unique for its non-standard blocks, which include skipped attention layers and variable Feed-Forward Network (FFN) ratios to optimize for single-GPU inference on hardware such as the NVIDIA H100 or H200.

This "Non-reasoning" variant is specifically tuned for direct instruction-following, chat preferences, and agentic tasks like tool calling and Retrieval-Augmented Generation (RAG). While the model family is capable of reasoning, this configuration focuses on high-throughput and efficiency for standard conversational AI applications. It underwent a multi-phase post-training process including Reinforcement Learning from Human Feedback (RLHF) and Reward-aware Preference Optimization (RPO).

The model features a context window of 131,072 tokens, allowing for the processing of extensive documents and long-form conversations. By employing NAS, NVIDIA significantly improved the model's accuracy-efficiency tradeoff, enabling it to top benchmarks like the Artificial Analysis Intelligence Index for its weight class. It is part of the broader Llama Nemotron collection, which aims to provide right-sized, high-performing models for enterprise deployment.

Llama Nemotron Super 49B v1.5 (Non-reasoning)

Explore AI Studio

Rankings & Comparison