Rime logo
Rime

Arcana v3

Released Feb 2026

Arcana v3 is a flagship text-to-speech (TTS) model developed by Rime AI, officially launched in February 2026. Designed for real-time conversational AI and enterprise applications, the model focuses on generating expressive, human-like speech with natural cadence, emotional resonance, and paralinguistic nuances such as breathing, pacing, and rhythm. Arcana v3 offers a diverse portfolio of over 90 distinct flagship voices that capture different age groups, regional accents, and tonal energies suited for use cases ranging from customer support to media narration.

A primary capability of Arcana v3 is its native multilingual code-switching. The model supports 11 languages—including English, Spanish, French, German, Japanese, Arabic, and Hindi—and allows a single voice to seamlessly transition between them mid-utterance. This allows the model to preserve the speaker's accent, prosody, and natural flow across language boundaries. By handling multiple languages natively, Arcana v3 consolidates infrastructure, eliminating the need to route audio through separate language-specific TTS models.

Architecture and Performance

Arcana v3 utilizes a multimodal, autoregressive architecture that generates discrete audio tokens from text inputs. These tokens are then decoded into high-fidelity speech using a specialized high-resolution audio codec. The system is engineered to capture subtle acoustic details and emergent conversational behaviors while maintaining high generation concurrency.

Built for real-time production environments, the model achieves a Time to First Byte (TTFB) latency of approximately 120 milliseconds out of the engine, and roughly 200 milliseconds via standard API endpoints. This low latency supports dynamic conversational interactions, allowing for mid-utterance control and natural user interruptions (barge-in) without awkward silences. The model also features robust text normalization to handle complex numerical values and abbreviations, word-level timestamps, and adjustable sampling rates to accommodate diverse telephonic and web-based audio requirements.

Rankings & Comparison