Fish Speech 1.5 by Fish Audio: Benchmarks, Rankings & Model Details

Fish Speech 1.5 is a multilingual text-to-speech (TTS) model developed by Fish Audio. It utilizes a Dual Autoregressive (DualAR) architecture, which pairs a transformer-based large language model (LLM) for semantic token prediction with a Firefly-GAN (VQ-GAN) acoustic decoder for audio reconstruction. The model is designed to handle speech synthesis without phoneme-level dependencies, relying instead on its generalization capabilities to process text scripts directly across 13 languages, including English, Chinese, Japanese, German, French, Spanish, and Arabic.

Architecture and Capabilities

The model was trained on a large-scale dataset of over 1 million hours of multilingual audio. This extensive training enables zero-shot voice cloning, allowing the model to replicate a target speaker's voice using as little as a few seconds of reference audio. It supports cross-lingual synthesis, where a speaker's characteristics from one language can be applied to text in another. Additionally, the model provides control over 64+ emotional expressions and voice styles through specific text markers.

Performance

Fish Speech 1.5 is optimized for low-latency generation, achieving a real-time factor of approximately 1:7 on modern GPU hardware like the NVIDIA RTX 4090. The model supports long-form audio generation through an expanded context window and prompt-based synthesis. It is released under the CC-BY-NC-SA-4.0 license for the weights, while the inference and training codebase is available under the Apache 2.0 license.

Fish Speech 1.5

Architecture and Capabilities

Performance

Explore AI Studio

Rankings & Comparison

Fish Speech 1.5

Architecture and Capabilities

Performance

Explore AI Studio

Rankings & Comparison