Eleven v3 is a flagship speech foundation model developed by ElevenLabs, engineered for high-fidelity text-to-speech (TTS) and expressive audio generation. Initially released in alpha in June 2025, the model represents a significant architectural evolution from its predecessors, prioritizing emotional depth, conversational nuance, and advanced linguistic accuracy across more than 70 languages. It is designed to handle complex narrative tasks, including professional voiceovers and interactive character dialogue.
Expressive Capabilities and Audio Tags
A defining feature of Eleven v3 is its support for inline audio tags, which allow users to direct the vocal performance by inserting descriptive prompts such as [excited], [whispers], [sighs], or [laughing] directly into the script. This system enables the model to generate non-verbal cues and specific emotional inflections that were difficult to achieve in previous iterations. Furthermore, the model supports multi-speaker dialogue, allowing it to maintain distinct voice identities and manage conversational flow, such as interruptions and reactions, within a single generation.
Technical Performance and Prompting
The model incorporates substantial improvements in text normalization, resulting in a significantly reduced error rate when vocalizing specialized notation like chemical formulas, phone numbers, and mathematical expressions. Eleven v3 is trained to better understand context, enabling it to distinguish between various uses of symbols—such as colons in sports scores versus time formats—and adjust its prosody accordingly. To achieve the best results, users are encouraged to select a base voice that naturally aligns with the desired tone, as the effectiveness of audio tags is heavily influenced by the vocal characteristics of the underlying voice profile.