NVIDIA logo
NVIDIA
Open Weights

llama-3.1-nemotron-51b-instruct

Released Sep 2024

Arena AI
#176
Context131K
Parameters51B

Llama-3.1-Nemotron-51B-Instruct is a large language model developed by NVIDIA, designed to offer an optimized balance between computational efficiency and model accuracy. It is a derivative of Meta's Llama-3.1-70B-Instruct and was developed using a novel Neural Architecture Search (NAS) approach combined with knowledge distillation.

The model's architecture was refined through a block-wise distillation process where various configurations were tested to find the optimal throughput-to-accuracy ratio. Key architectural features include Variable Grouped Query Attention (VGQA), which allows for a different number of key-value heads in each block, and skip-attention layers where certain attention mechanisms are replaced with linear layers to reduce complexity. These optimizations allow the 51B model to fit on a single NVIDIA H100 (80GB) GPU even under heavy workloads.

Following the architectural pruning, the model underwent knowledge distillation on a 40-billion-token dataset comprising mixtures of FineWeb, Buzz-V1.2, and Dolma. This phase was specifically focused on aligning the model for English single-turn and multi-turn chat use cases, ensuring it retains human preference alignment similar to its larger 70B predecessor while providing significantly higher inference speeds.

Rankings & Comparison