MetaVoice v1, also known as MetaVoice-1B, is an open-source foundation model designed for human-like, expressive text-to-speech (TTS). It utilizes a transformer-based architecture with 1.2 billion parameters and was trained on approximately 100,000 hours of speech data. The model is engineered to capture natural emotional rhythms and tones, specifically focusing on English speech.
Features and Architecture
A primary capability of MetaVoice v1 is zero-shot voice cloning, which allows the model to replicate a target speaker's voice using as little as 30 seconds of reference audio. It also supports cross-lingual voice cloning with minimal fine-tuning. The architecture combines a causal transformer for predicting hierarchical audio tokens and a non-causal transformer for refining lower-level representations, followed by a diffusion-based decoder to generate the final waveform.
The model is released under the Apache 2.0 license, facilitating both research and commercial use. It is designed to handle long-form synthesis and aims to minimize unintended audio artifacts or hallucinations during the generation process.