Google logo
Google

Gemini 3.1 Flash TTS

Released Apr 2026

Gemini 3.1 Flash TTS is a production-grade text-to-speech model released by Google on April 15, 2026. Developed as part of the Gemini 3.1 family, it focuses on high-fidelity, expressive audio generation while maintaining the low latency and cost-efficiency characteristic of the Flash series. The model is designed to handle complex narration tasks, providing more granular control over vocal performance than previous iterations.

A central feature of Gemini 3.1 Flash TTS is its system of over 200 audio tags, which allow users to guide vocal style, pacing, and emotion using natural language commands. By embedding tags such as [whispered], [excited], or [shouting] directly into the text, the model can adjust its delivery mid-sentence. This instruction-based workflow eliminates the need for legacy markup like SSML, making it easier to direct specific performances for audiobooks, podcasts, and digital assistants.

Capabilities and Performance

The model supports native multi-speaker dialogue, allowing for consistent character voices and natural turn-taking within a single generation pass. It covers more than 70 languages and includes specialized regional accent controls, such as American "Southern," British "RP," and various local dialects. In benchmarking by Artificial Analysis, Gemini 3.1 Flash TTS achieved an Elo score of 1,211, placing it in the top tier of TTS models for its balance of audio quality and generation speed.

For safety and content authenticity, the model integrates Google's SynthID watermarking technology. This digital watermark is woven directly into the audio output, providing a method for identifying AI-generated content without affecting the audible quality of the speech. The model supports a text input limit of up to 16,000 tokens and is optimized for high-volume production environments.

Rankings & Comparison