Fish Audio logo
Fish Audio
Open Weights

Fish Audio S2 Pro

Released Mar 2026

AA Arena
#9
Parameters4.4B

Fish Audio S2 Pro is an open-weights large audio model (LAM) designed for expressive text-to-speech (TTS) and voice cloning. It is the flagship successor to the Fish Speech S1 series, introducing a novel Dual-Autoregressive (Dual-AR) architecture to balance high-fidelity acoustic generation with efficient inference. The model supports over 80 languages and is optimized for low-latency production streaming, achieving a time-to-first-audio (TTFA) of approximately 100ms on high-end hardware.

Architecture and Design

The model's architecture is bifurcated into two specialized stages. The Slow AR component, a 4-billion parameter model based on a decoder-only transformer backbone, operates along the time axis to predict primary semantic codebooks and capture prosodic structure. The Fast AR component, a 400-million parameter model, generates the remaining residual codebooks for each time step. This hierarchical approach utilizes Residual Vector Quantization (RVQ) with 10 codebooks to reconstruct 44.1kHz audio while maintaining a manageable token count for the transformer.

A defining feature of S2 Pro is its support for fine-grained inline control via natural language. Users can embed free-form textual descriptions within brackets—such as [whisper], [laugh], or [professional broadcast tone]—directly at specific word positions to steer prosody and emotion. Because these tags are processed as standard text rather than predefined tokens, the system supports thousands of unique expressive cues. Additionally, the model natively handles multi-speaker dialogue and zero-shot voice cloning from reference audio samples ranging from 10 to 30 seconds.

Performance and Capabilities

During training, the system leveraged a unified data-and-reward pipeline, where models used for data filtering served as reward models for reinforcement learning (RL) alignment. On benchmarks like the Audio Turing Test and EmergentTTS-Eval, S2 Pro demonstrated high win rates against both open and closed-source baselines in paralinguistics and syntactic complexity. The model is compatible with advanced serving optimizations including continuous batching and RadixAttention through its structural isomorphism with standard autoregressive language models.

Prompting Guide

  • Inline Emotion Tags: Use brackets to insert style cues at any point in the text, such as [excited] Hello! [whispers] How are you?.
  • Multi-Speaker Dialogue: Use speaker tokens to switch between different cloned voices within a single pass: <|speaker:1|> Hello! <|speaker:2|> Nice to meet you..
  • Voice Cloning: Provide 10–30 seconds of clear reference audio to extract the target speaker's features.

Rankings & Comparison