Inworld TTS 1.5 Max is a flagship text-to-speech model designed for high-performance real-time voice applications. Released as a significant upgrade to Inworld's voice AI stack, the model focuses on delivering a balance of natural expressiveness and low latency. It typically achieves a median time-to-first-audio (TTFA) of less than 200 milliseconds and a P90 latency under 250 milliseconds, making it suitable for interactive conversational agents and gaming NPCs.
The model supports 15 languages, including English, Chinese, Japanese, Korean, and Hindi, and offers high-fidelity instant voice cloning from brief audio references. Compared to previous versions, version 1.5 introduces a 30% improvement in emotional expressiveness and a 40% reduction in word error rates (WER). It is built on a streaming-native architecture that supports delivery via WebSocket to eliminate buffering delays.
Technical features include enhanced timestamp metadata providing phonetic details and visemes for precise lip-sync synchronization and captioning. Inworld TTS 1.5 Max has been noted for its performance on independent benchmarks, such as the Artificial Analysis TTS Leaderboard, where it has held top rankings for human-like quality and naturalness based on blind user evaluations.