Hunyuan-Large-Vision is a multimodal mixture-of-experts (MoE) model developed by Tencent, designed for high-resolution visual understanding and complex linguistic reasoning. Released in August 2025, it serves as the vision-integrated extension of the Hunyuan-Large model family. The architecture utilizes a sparse MoE structure with a total of 389 billion parameters, activating 52 billion parameters per forward pass to maintain computational efficiency while scaling capability.
The model integrates a custom 1 billion parameter Vision Transformer (ViT) for image processing, which is linked to the primary MoE language backbone via a connector module. This unified framework enables the model to process and interpret images, videos, and 3D spatial data at variable resolutions. It was trained on a pipeline involving over 400 billion multimodal samples, utilizing rejection sampling and automated data refinement to ensure high-quality instruction following.
Hunyuan-Large-Vision demonstrates significant proficiency in multilingual scene understanding, OCR, and mathematical reasoning. On its release, it achieved top-tier rankings on multimodal benchmarks such as the LMArena Vision Leaderboard and the OpenCompass Academic Benchmark. The model is provided as a closed-source solution accessible via API for enterprise-grade applications.