Zonos-v0.1 is an open-weight text-to-speech (TTS) model developed by Zyphra. It is designed for naturalistic and expressive speech generation from text prompts, using either a speaker embedding or an audio prefix for conditioning. The model was trained on a dataset comprising approximately 200,000 hours of speech, primarily in English, but it also supports Japanese, Chinese, French, and German.
Architecture
The Zonos-v0.1 suite features two model variants: a 1.6B parameter Transformer and a 1.6B parameter Hybrid-SSM. The hybrid architecture utilizes interleaved Mamba2 and transformer layers, representing an application of State Space Models (SSM) for audio generation. The system pipeline processes text via eSpeak for normalization and phonemization before predicting discrete audio code (DAC) tokens for synthesis.
Capabilities
The model is capable of zero-shot voice cloning using reference audio samples ranging from 5 to 30 seconds. Users can adjust specific speech parameters such as speaking rate, pitch variation, and audio quality. It also features emotion conditioning, allowing for the generation of speech reflecting various states including happiness, anger, sadness, fear, and surprise. Zonos-v0.1 natively generates audio at a 44kHz sampling rate and is released under the Apache 2.0 license.