Google just dropped Gemini 3.1 Flash TTS, and for once, the hype might actually be justified. I’ve been testing AI speech models for years, and the biggest frustration has always been the lack of fine-grained control. You’d get a voice that sounds decent, but try to make it speak faster, add a pause, or sound excited about something? Good luck.
This new model from the Gemini team finally addresses that. The headline feature is something they call “audio tags” — think of them as inline commands you can embed directly into your text input. Want the AI to speak faster? Just add a tag. Need a dramatic pause before the punchline? There’s a tag for that. It’s not revolutionary in concept (we’ve seen similar approaches in niche TTS tools), but Google is baking it into a production-grade model that supports over 70 languages and outputs genuinely natural-sounding speech.
Quality that actually competes
On the Artificial Analysis TTS leaderboard, which runs thousands of blind preference tests, Gemini 3.1 Flash TTS scored an Elo of 1,211. That’s competitive with the best proprietary models out there, and significantly better than what Google was shipping before. The model also lands in what Artificial Analysis calls the “most attractive quadrant” — a fancy way of saying it balances high quality with low cost. That matters if you’re building anything at scale.
Audio tags: the real differentiator
Here’s how the audio tags work in practice. You write your text as normal, but you can sprinkle in natural language commands like [speak faster] or [whisper] or [pause 500ms]. The model interprets these and adjusts the output accordingly. It’s not a separate configuration file or a complex API call — it’s just text with instructions baked in.
I’ve seen this approach tried before, but Google’s implementation feels more robust. The tags are granular enough to control pacing, vocal style, and even emotional tone. For developers building interactive voice applications, this is a game-changer. You can finally script dialogue that doesn’t sound like a robot reading a manual.
Multi-speaker and multilingual
Another feature worth calling out: native multi-speaker dialogue. The model can switch between different voices within a single audio stream, which makes it ideal for generating podcasts, audiobooks, or any content with multiple characters. Combined with support for 70+ languages, this covers a lot of use cases out of the box.
Availability and watermarking
Right now, Gemini 3.1 Flash TTS is in preview. Developers can access it through the Gemini API and Google AI Studio. Enterprise teams get it via Vertex AI. Workspace users will find it in Google Vids. All generated audio is watermarked with SynthID, Google’s tool for identifying AI-generated content. That’s a responsible move, especially as deepfake audio becomes harder to spot.
The catch
Is it perfect? No. Preview means rough edges. I expect some tags won’t work as intended in every language. The multi-speaker feature is impressive but still feels a bit rigid compared to truly dynamic voice generation. And while the cost is low, it’s not free — you’re paying for API calls.
Still, this is the most promising TTS model Google has released in years. If you build voice applications, it’s worth testing. If you just want to hear what AI speech sounds like in 2026, the samples speak for themselves.
Comments (0)
Login Log in to comment.
Be the first to comment!