Chatterbox is a family of open-source text-to-speech (TTS) models developed by Resemble AI, released under the MIT license. It is designed for zero-shot voice cloning, enabling the synthesis of a target voice from just a few seconds of reference audio. The model is notable for its emotion exaggeration control, a feature that allows users to adjust the intensity and emotional tone of the generated speech, ranging from monotone to highly dramatic delivery.
Architecture and Training
The primary Chatterbox model is built on a 500M parameter Llama backbone and was trained on approximately 500,000 hours of cleaned audio data. It incorporates an alignment-informed inference process to maintain stability across various sentence lengths and complexities. To promote responsible AI use, all generated audio includes Resemble AI’s PerTh (Perceptual Threshold) Watermarker, which embeds imperceptible neural data designed to survive audio compression and common manipulations while remaining detectable for traceability.
Capabilities and Variants
In addition to the base English model, Resemble AI released a Multilingual variant that supports over 23 languages, including Spanish, French, German, Chinese, and Hindi. The family also includes Chatterbox Turbo, a more efficient variant with 350M parameters and a distilled decoder that reduces generation steps to allow for real-time, low-latency performance. Benchmarks conducted by the creators indicate that the models perform competitively against leading proprietary speech synthesis platforms in blind evaluations.