Realtime TTS 1.5 Mini by Inworld: Benchmarks, Rankings & Model Details

Realtime TTS 1.5 Mini is a lightweight, low-latency text-to-speech model developed by Inworld. Released as part of the TTS-1.5 generation of voice models, it is explicitly optimized for hyper-latency sensitive applications such as conversational AI, interactive voice agents, and real-time translation. The model focuses on delivering high-speed audio generation while retaining human-like naturalness, achieving a time-to-first-audio P90 latency of under 130 milliseconds.

The model introduces substantial performance upgrades over Inworld's previous TTS generations, including a reported 40% reduction in word error rates and a 30% increase in expressive capabilities. It supports speech synthesis across 15 languages, including English, Spanish, French, Chinese, Arabic, and Hindi. In addition to a built-in library of over 130 preset voices, Realtime TTS 1.5 Mini features instant zero-shot voice cloning, requiring only 5 to 15 seconds of reference audio to replicate a target voice.

To control the output, the model allows granular adjustments over generation settings, including speaking rate, temperature for expressiveness variance, and high-fidelity sample rates up to 48kHz. It features optional text normalization to automatically expand numbers, dates, and abbreviations prior to audio synthesis. Furthermore, Realtime TTS 1.5 Mini can generate word-level or character-level timestamps alongside the audio, providing necessary metadata for synchronizing visual lip movements or generating closed captions.

Realtime TTS 1.5 Mini

Explore AI Studio

Rankings & Comparison

Realtime TTS 1.5 Mini

Explore AI Studio

Rankings & Comparison