Resemble AI
Open Weights

Chatterbox HD

Released May 2025

AA Arena
#42
Parameters500M

Chatterbox HD is a high-fidelity text-to-speech and voice cloning model developed by Resemble AI. As a premium variant of the Chatterbox model family, it is optimized for high-resolution audio production, supporting sampling rates up to 48kHz. The model architecture is built on a transformer-based backbone—utilizing Llama-derived components—to perform sophisticated speech synthesis and voice conversion. The model's primary capability is zero-shot voice cloning, which allows for the replication of a target speaker's voice using as little as five seconds of reference audio. It introduces an emotion exaggeration parameter, giving users granular control over the emotional intensity and delivery of the generated speech, ranging from flat monotone to highly expressive styles. To address safety and authenticity, Chatterbox HD incorporates the PerTh (Perceptual Threshold) Watermarker. This neural watermarking technology embeds imperceptible data into the audio stream to enable the detection of AI-generated content, designed to survive common editing processes such as MP3 compression and re-sampling.

Rankings & Comparison