Kokoro 82M v1.0 by Kokoro: Benchmarks, Rankings & Model Details

Kokoro 82M v1.0 is an open-weight text-to-speech (TTS) model designed for high-efficiency audio synthesis. It features a compact 82-million parameter architecture that allows for fast inference and deployment on consumer-grade hardware. Despite its small size, it produces speech quality comparable to much larger models and has ranked highly in performance benchmarks such as the TTS Spaces Arena.

Architecture and Design

The model is built on the StyleTTS 2 architecture and utilizes an ISTFTNet vocoder. It employs a streamlined decoder-only structure that operates without diffusion or a separate encoder, contributing to its speed and low memory footprint. Text processing is handled by external grapheme-to-phoneme (G2P) libraries, such as espeak-ng and misaki, which convert input text into International Phonetic Alphabet (IPA) tokens before synthesis.

Capabilities

Version 1.0 of the model supports 8 languages and features a library of 54 distinct voices. It supports various audio output formats and can handle long-form content by processing text in segments. The model is also resilient to quantization, enabling further reductions in size for optimized deployment scenarios.

Training and Data

Kokoro was trained exclusively on permissive and non-copyrighted audio data, including public domain recordings and synthetic audio. The total training process for the v1.0 release involved approximately 1,000 GPU hours on A100 instances, with a focus on transparency and reproducibility. The model and its weights are released under the Apache-2.0 license.

Kokoro 82M v1.0

Architecture and Design

Capabilities

Training and Data

Explore AI Studio

Rankings & Comparison

Kokoro 82M v1.0

Architecture and Design

Capabilities

Training and Data

Explore AI Studio

Rankings & Comparison