Inworld TTS 1.5 Mini by Inworld: Benchmarks, Rankings & Model Details

Inworld TTS 1.5 Mini is a high-speed, cost-efficient text-to-speech model designed by Inworld for ultra-low latency applications. Released in early 2026 as part of the TTS 1.5 family, it is optimized for real-time interactive experiences such as gaming and conversational AI agents. The model achieves a P90 time-to-first-audio latency of under 130ms and a median latency of approximately 120ms, representing a significant speed improvement over previous generations.

The model utilizes a Transformer-based autoregressive architecture, often referred to as a SpeechLM. Compared to its predecessors, the 1.5 version provides roughly 30% greater expressiveness and a 40% reduction in word error rates (WER), leading to more stable and natural-sounding speech with fewer artifacts or hallucinations. It supports 15 languages, including English, Spanish, Japanese, Chinese, Hindi, Arabic, and Hebrew, across a library of over 130 preset voices.

Key Capabilities

Zero-Shot Voice Cloning: Users can generate a custom voice by providing a 5–15 second reference audio sample.
Precision Timestamps: The model provides word-level, phoneme-level, and viseme-level timestamps, making it suitable for high-fidelity lip-sync and character animation synchronization.
Streaming Support: It is built for real-time WebSocket streaming, allowing audio to be played as soon as the first chunks are synthesized.
Expression and Style Control: Support for emotional markup and non-verbal vocalizations such as [sigh], [laugh], and [breathe] allows for more nuanced character personality.

Key Capabilities

Zero-Shot Voice Cloning: Users can generate a custom voice by providing a 5–15 second reference audio sample.
Precision Timestamps: The model provides word-level, phoneme-level, and viseme-level timestamps, making it suitable for high-fidelity lip-sync and character animation synchronization.
Streaming Support: It is built for real-time WebSocket streaming, allowing audio to be played as soon as the first chunks are synthesized.
Expression and Style Control: Support for emotional markup and non-verbal vocalizations such as [sigh], [laugh], and [breathe] allows for more nuanced character personality.

Inworld TTS 1.5 Mini

Key Capabilities

Explore AI Studio

Rankings & Comparison

Inworld TTS 1.5 Mini

Key Capabilities

Explore AI Studio

Rankings & Comparison