Sonic 3 is a foundation text-to-speech (TTS) model developed by Cartesia, optimized for high-performance, real-time conversational AI. It is built on a State Space Model (SSM) architecture rather than a traditional transformer, allowing it to maintain conversational context and emotional states with high efficiency and lower computational overhead.
The model provides support for 42 languages and enables granular control over prosody, including adjustments for volume, speed, and emotion. It is capable of generating human-like nuances, such as natural laughter, which can be triggered via specific API tags or SSML. This makes it suitable for diverse applications ranging from customer service agents to expressive digital characters.
A key focus of Sonic 3 is its ultra-low latency, with a reported time-to-first-audio as low as 90 milliseconds. This responsiveness is designed to facilitate fluid, back-and-forth dialogue in interactive environments. The model also supports voice cloning and features a library of specialized voices optimized for either stability or emotive performance.