Hunyuan-TurboS is a large-scale hybrid language model developed by Tencent, utilizing a architecture that combines Transformer and Mamba2 components within a Mixture of Experts (MoE) framework. Designed for both high-performance reasoning and low-latency inference, the model features an adaptive chain-of-thought mechanism that distinguishes between "fast-thinking" for rapid responses and "slow-thinking" for complex problem-solving in areas like mathematics and science.
The model's architecture consists of 128 layers, incorporating Mamba2 for linear sequence complexity and Grouped-Query Attention (GQA) to minimize KV cache overhead. With a total of 560 billion parameters and 56 billion activated parameters per token, Hunyuan-TurboS is optimized for high efficiency, aiming to deliver response times of under one second for many standard queries while maintaining reasoning capabilities comparable to top-tier proprietary models.
Hunyuan-TurboS supports a native context window of 256,000 tokens, allowing it to ingest and process large documents or long meeting transcripts. It was pre-trained on a 16-trillion-token dataset, with post-training strategies including supervised fine-tuning and multi-stage reinforcement learning to enhance its performance in STEM and general instruction-following tasks.