DeepSeek logo
DeepSeek
Open Weights

DeepSeek V3 (Dec '24)

Released Dec 2024

DeepSeek-V3 is a large-scale Mixture-of-Experts (MoE) language model released in December 2024. It comprises 671 billion total parameters, with approximately 37 billion activated for each token during inference. The model is designed to achieve performance comparable to leading proprietary models in coding, mathematics, and general reasoning tasks while maintaining high training and inference efficiency.

The architecture builds on the Multi-head Latent Attention (MLA) and DeepSeekMoE framework. Key innovations include an auxiliary-loss-free load balancing strategy, which manages expert utilization without the performance trade-offs of traditional auxiliary loss functions, and a Multi-Token Prediction (MTP) objective that densifies training signals and facilitates speculative decoding for faster inference.

DeepSeek-V3 was pre-trained on a corpus of 14.8 trillion tokens using an FP8 mixed-precision training framework. This approach allowed for efficient scaling on large GPU clusters by reducing memory and communication overhead. Following pre-training, the model underwent supervised fine-tuning and reinforcement learning, which included knowledge distillation from the DeepSeek-R1 series to enhance its logic and reasoning capabilities.

The model supports a context window of 128K tokens and is provided under a license that permits commercial use. It demonstrates significant capabilities in STEM fields and is recognized for its cost-effective training process relative to its total parameter scale.

Rankings & Comparison