Qwen3-TTS is a family of advanced text-to-speech models developed by Alibaba's Qwen team, released in January 2026. The series is designed for high-fidelity speech synthesis, featuring native support for zero-shot voice cloning and natural language voice design. Utilizing a dual-track streaming architecture, the models enable real-time interaction with an end-to-end latency as low as 97ms, capable of delivering the first audio packet after processing a single character.
The architecture is built on a discrete multi-codebook language model framework rather than a traditional diffusion-based system. This design is powered by the Qwen3-TTS-Tokenizer-12Hz, which compresses audio into discrete tokens while effectively preserving paralinguistic details such as intonation, rhythm, and emotional nuance. The series is released in two primary scales: a 1.7B flagship variant for peak performance and a 0.6B lightweight version optimized for efficient deployment.
Key Features and Control
A central capability of Qwen3-TTS is its three-second voice cloning, which allows for high-fidelity replication of a speaker's identity from a very short reference sample. It also introduces Voice Design, enabling the creation of custom personas through descriptive prompts such as "a calm, scholarly voice with a slight accent." The model is natively multilingual, supporting over 10 languages including Chinese, English, Japanese, Korean, German, and French, with robust performance in cross-lingual synthesis where a cloned voice can speak a different language fluently while maintaining its characteristic timbre.
Users can control the generated output through natural language instructions to adjust acoustic attributes like emotion and prosody. For instance, prompting the model to "speak joyfully" or "whisper softly" allows the semantic understanding of the base LLM to influence the final vocal delivery. The model is released under the Apache 2.0 license, making it suitable for both research and commercial integration into conversational agents and digital content workflows.