Gradium

Gradium TTS

Released Mar 2026

AA Arena
#43
Parameters100M

Gradium TTS is a suite of audio generative models designed for high-fidelity, ultra-low latency speech synthesis. Developed by research engineers from Kyutai and Google, the models emphasize natural prosody and expressive delivery, catering specifically to conversational AI applications where Time To First Audio (TTFA) is a critical performance metric. The architecture is built to support full-duplex interactions, allowing for natural interruptions and overlapping speech similar to human-to-human dialogue.

The platform provides a range of deployment options, including a managed Cloud API and specialized edge models. A notable variant is Phonon, an on-device model featuring approximately 100 million parameters designed to maintain high speaker similarity and low word error rates in resource-constrained environments. Gradium models are natively multilingual, offering consistent voice identity and prosody across multiple languages with the ability to handle seamless mid-sentence code-switching.

Key Capabilities and Controls

Gradium TTS incorporates advanced developer tools to refine audio output. It supports high-precision word-level timestamps for perfect text-audio synchronization in visual applications. Developers can use specialized tags such as <flush> to force immediate audio generation and <break time> to insert precise pauses between 0.1 and 2.0 seconds. The system also supports custom pronunciation dictionaries for brand names, technical jargon, and acronyms.

For voice customization, the models enable instant voice cloning from as little as 10 seconds of reference audio. More advanced "Pro Voice Clones" are available through fine-tuning, providing high speaker similarity for professional-grade replicas. The underlying engine is optimized for high-concurrency production workloads, maintaining stable latency and predictable behavior even at scale.

Rankings & Comparison