Kimi Linear 48B A3B Instruct is a large language model developed by Moonshot AI that utilizes a hybrid Mixture-of-Experts (MoE) architecture with linear attention mechanisms. The model is distinguished by its use of Kimi Delta Attention (KDA), a hardware-efficient linear attention module that scales linearly with input length, alongside Multi-Head Latent Attention (MLA) layers. This combination is designed to process extremely long contexts, supporting windows of up to 1 million tokens while reducing key-value (KV) cache memory requirements by up to 75%.\n\n## Architecture and Efficiency\nThe model features a total of 48 billion parameters, with 3 billion active parameters per forward pass. By employing a 3:1 ratio of KDA to global attention layers, it achieves up to 6x faster decoding throughput at long context lengths compared to traditional softmax-based models. This architectural choice addresses the quadratic scaling bottleneck of standard transformers, making it efficient for tasks involving long-horizon reasoning and extensive document synthesis.\n\n## Capabilities\nThe Instruct variant is fine-tuned to follow complex user instructions and handle multi-turn conversations. It excels in long-context retrieval and agentic workflows. Performance evaluations indicate that it matches or exceeds the quality of full-attention models in both short and long context benchmarks while providing substantial speed and memory improvements for high-volume inference scenarios.