XTTS v2 is a multilingual generative text-to-speech (TTS) model developed by Coqui, designed for speech synthesis and zero-shot voice cloning. It can replicate a target speaker's voice using a reference audio clip as short as six seconds. The model supports 17 languages, including English, Spanish, French, German, Italian, and Chinese, and is capable of cross-lingual cloning where a voice from one language is used to generate speech in another.
The model architecture utilizes an autoregressive GPT-2 style decoder that predicts discrete audio tokens. It incorporates a Perceiver mechanism to extract speaker characteristics from reference mel-spectrograms and a discrete Variational Autoencoder (VAE) for audio representation. Compared to its predecessor, XTTS v2 features architectural improvements in speaker conditioning and prosody, as well as support for multiple speaker references and speaker interpolation.
XTTS v2 supports streaming inference with low latency, often cited as under 200ms in optimized environments. It was released under the Coqui Public Model License (CPML), which permits non-commercial use. Following the closure of Coqui AI in early 2024, the model has remained widely used within the open-source community through its available weights and the associated TTS library.