Inworld logo
Inworld

Realtime TTS 1.5 Max

Released Jan 2026

Realtime TTS 1.5 Max is a proprietary text-to-speech (TTS) model developed by Inworld, designed for expressive, low-latency voice synthesis. Serving as the primary, higher-quality variant within the TTS 1.5 family alongside the faster "Mini" version, it balances realistic voice generation with the responsiveness required for conversational applications. The model is built to power interactive media, live translation, and voice-enabled AI agents by producing human-like speech with high emotional nuance.

The model introduces significant architectural improvements over its predecessors, achieving a 40% reduction in word error rates to minimize audio artifacts, mispronunciations, and unnatural pacing. Realtime TTS 1.5 Max features extensive multilingual support, providing native-quality synthesis across 15 languages, including English, Spanish, French, German, Mandarin Chinese, Japanese, and Hindi. It also supports professional voice cloning, allowing users to generate custom voices from 5 to 15 seconds of reference audio.

Optimized for real-time performance, Realtime TTS 1.5 Max achieves a time-to-first-audio P90 latency of under 250ms (with median latencies around 200ms), enabling uninterrupted bidirectional interactions. The model includes granular synchronization capabilities, outputting timestamp alignments at the character, word, phoneme, and viseme levels, which are frequently utilized for animating digital avatars. Users can further fine-tune the generated audio by adjusting synthesis parameters such as temperature to control voice expressiveness, alongside modifying speaking rates and utilizing standard audio formats like MP3, OPUS, WAV, and FLAC.

Rankings & Comparison