Inworld TTS 1 Max is a generative text-to-speech (TTS) model designed for high-resolution, expressive audio synthesis. It is a Transformer-based autoregressive model that utilizes a LLaMA-3.1-8B backbone as its Speech-Language Model (SpeechLM) component. Released as the high-performance variant of the Inworld TTS-1 family, the model is optimized for producing contextually aware speech that captures subtle nuances in tone and prosody.
Architecture and Training
The model features an 8.8 billion parameter architecture and a high-resolution audio codec built on the X-codec2 architecture, enabling the native generation of 48 kHz audio. Its development involved a three-stage training pipeline: large-scale pre-training on over 1 million hours of audio-text data, supervised fine-tuning (SFT) on 200,000 hours of high-quality speech, and reinforcement learning alignment using Group Relative Policy Optimization (GRPO). This alignment process is designed to reduce word error rates and hallucinations while maintaining high speaker similarity.
Key Capabilities
Inworld TTS 1 Max supports zero-shot voice cloning from as little as 2 to 15 seconds of reference audio. It provides fine-grained control over vocal delivery through "voice tags," allowing users to specify non-verbal sounds and emotions such as whispering, coughing, or surprise. The model supports approximately 12 languages, including English, Chinese, Korean, French, and Spanish, and is engineered for low-latency performance suitable for real-time conversational AI applications.