SIMBA 1.6 (Scaling Inference-based Model for Better Audio) is a proprietary text-to-speech (TTS) model developed by Speechify's AI Research Lab. Designed to bridge the gap between synthetic and human narration, it focuses on generating highly expressive audio with natural intonation and prosody. The model is engineered for production-grade workloads, offering a balance of high-fidelity output and low-latency performance.
As a core component of the Speechify Voice API, SIMBA 1.6 supports advanced vocal features including zero-shot voice cloning, which allows for the creation of custom voices from a 10-30 second audio sample. It also features robust emotion control and full SSML support, enabling fine-grained adjustments to pitch, emphasis, and speaking rate to match the context of the text.
The model's architecture is optimized for long-form stability, ensuring that the voice remains consistent over extended periods of reading, such as in audiobooks or lengthy articles. In independent benchmarking, SIMBA 1.6 has been noted for its high generation speed, measured in characters per second (CPS), while maintaining competitive quality scores in user-preference evaluations like the Speech Arena.
SIMBA 1.6 is built to handle structured content effectively, integrating with document intelligence layers that allow it to process complex PDFs and web pages. This enables the model to bypass non-narrative text like headers and footers, delivering a coherent and logically ordered audio stream for end users.