OpenAudio S1 is a large-scale audio foundation model developed by Fish Audio. It functions as a unified system for high-fidelity text-to-speech (TTS) and voice cloning, designed to replicate the natural expressiveness and nuance of professional human speech. The model utilizes a dual-autoregressive architecture that jointly models semantic and acoustic information in a single stage, which reduces artifacts and information loss compared to traditional multi-stage pipelines.
Trained on a massive dataset of over 2 million hours of multilingual audio, OpenAudio S1 supports zero-shot and few-shot voice cloning from audio samples as short as 10 seconds. It incorporates online reinforcement learning from human feedback (RLHF) using the GRPO (Group Relative Policy Optimization) algorithm to refine its prosody and intonation. This training approach allows the model to handle complex instructions and follow precise emotional cues.
The model is known for its granular control over delivery, supporting over 50 distinct emotion and tone markers such as (angry), (whispering), (joyful), and (chuckling). It is released in two primary variants: a flagship 4B parameter version and a distilled 0.5B parameter version known as OpenAudio S1-mini, which is optimized for lower-latency applications while maintaining core capabilities.