Alibaba logo
Alibaba
Open Weights

Fun-Realtime-TTS-Preview

Released May 2026

Fun-Realtime-TTS-Preview is a real-time text-to-speech (TTS) model developed by Alibaba's Tongyi Lab as part of the FunAudioLLM project. Designed specifically for low-latency, streaming-capable voice synthesis, the model focuses on generating natural human-like speech for interactive applications such as digital assistants, real-time translation, and accessibility tools. In May 2026, it gained international recognition by ranking as the top Chinese-engineered voice system on the Artificial Analysis Speech Arena leaderboard.

A primary distinguishing feature of the model is its extensive support for linguistic diversity. It supports over 30 languages and is specifically optimized for complex Chinese speech patterns, covering seven major Chinese dialect families and more than 20 regional accents. This capability addresses traditional bottlenecks in Asian voice technology, where models trained on standard Mandarin often struggle with the nuances of regional speech and vernacular.

Technical performance is centered on a bi-streaming architecture that supports both text-in and audio-out streaming. This design enables the model to begin generating audio almost instantly as text tokens arrive, achieving latency benchmarks as low as 150ms in optimized environments. The model is integrated into Alibaba's broader ecosystem, including applications like the Qwen assistant, DingTalk, and Gaode Maps, providing services such as smart navigation and automated meeting minutes.

While the model maintains a focus on high-fidelity synthesis, it also supports instruction-following capabilities for controlling prosody, emotion, and speaker timbre. In benchmark evaluations, it has demonstrated high scores in content consistency and prosodic naturalness, outperforming several contemporary Western and domestic alternatives in blind user preference tests.

Rankings & Comparison