Cartesia logo
Cartesia

Sonic English (Oct '24)

Released Oct 2024

Sonic English (Oct '24) is a low-latency text-to-speech model developed by Cartesia. It is a specialized version of the company's flagship Sonic engine, optimized for English speech generation. Unlike many contemporary audio models that rely on Transformer architectures, the Sonic family is built using State Space Models (SSMs). This architectural choice is designed to provide greater efficiency and ultra-low latency, enabling real-time conversational interactions with a time-to-first-audio (TTFA) typically under 100 milliseconds.

Architecture and Performance

The model leverages the efficiency of SSMs (such as Mamba) to achieve near-linear scaling and constant-time inference. This allows it to generate high-resolution audio while maintaining low computational overhead. The October 2024 version reflects refinements in prosody and naturalness, aimed at producing speech that better captures human-like nuances in tone and rhythm.

Key Capabilities

Sonic English supports a range of expressive features, including fine-grained controls for speed, volume, and emotion. It is capable of generating non-verbal sounds, such as laughter, and supports high-fidelity voice cloning from short audio samples (as little as 10 seconds). The model is primarily designed for integration into voice agents, interactive assistants, and creative media platforms where responsive, human-like dialogue is required.

Rankings & Comparison