Inworld TTS 1 is a generative text-to-speech model family developed by Inworld AI, optimized for real-time interactive applications such as gaming and conversational agents. The model employs a Speech-Language Model (SpeechLM) architecture, utilizing an autoregressive Transformer backbone to convert text into streaming audio tokens, which are then reconstructed into audio via a neural decoder. It is engineered for low-latency performance, aiming for a median response time of approximately 200 milliseconds to the first audio chunk.
The standard Inworld TTS 1 model features 1.6 billion parameters and is built upon a LLaMA-3.2-1B backbone, while a larger "Max" variant utilizes an 8.8B parameter backbone. The architecture supports high-resolution 48 kHz audio output and features zero-shot voice cloning capabilities, allowing for the creation of custom voices from short audio samples. At launch, the models supported 11 languages with a focus on prosodic consistency and emotional expression.
The training of Inworld TTS 1 involved a three-stage framework, including large-scale pre-training on approximately 1 million hours of audio, supervised fine-tuning, and reinforcement learning alignment. The alignment stage utilized Group Relative Policy Optimization (GRPO) to optimize the model against perceptual quality metrics, improving stability and reducing word error rates in generated speech. The model also supports "audio markups," which provide fine-grained control over non-verbal vocalizations and emotional nuances like whispering.