xAI logo
xAI

xAI Text to Speech

Released Mar 2026

xAI Text to Speech is a high-fidelity neural voice synthesis model designed for low-latency, expressive audio generation. Developed as part of xAI's media engine, the model is optimized for real-time conversational applications, providing sub-second latency suitable for full-duplex voice agents. It supports more than 25 languages and offers a diverse set of pre-configured vocal profiles, including the voices Ara, Eve, Leo, Rex, and Sal.

The model is characterized by its support for speech tags, which allow for granular control over the delivery and prosody of the generated audio. Users can insert inline markers such as [pause] and [laugh] to simulate natural human interruptions, or use wrapping tags like <whisper>, <slow>, and <build-intensity> to modify the emotional tone and pacing of specific text segments. This system enables the model to handle complex storytelling and nuanced dialogue beyond standard text recitation.

Technically, the model is designed to operate within a unified speech-to-speech stack. It integrates directly with xAI’s language models and transcription services to minimize the end-to-end latency typically found in cascaded AI systems. The engine supports various output formats, including high-fidelity MP3 and WAV, as well as telephony-optimized protocols like G.711 (μ-law/A-law) and PCM, ensuring compatibility across web, mobile, and telecommunication platforms.

While the underlying architecture remains proprietary, the model is engineered for high-volume inference and robust text normalization. This allows it to accurately convert abbreviations, dates, and specialized terminology into spoken form. The model's infrastructure is the same technology utilized for voice features within the Grok assistant and integrated into broader ecosystems including Tesla vehicle software and Starlink customer support interfaces.

Rankings & Comparison