Google’s Gemini 3.1 Flash TTS ships with 200 audio tags and undercuts ElevenLabs on price

Google released Gemini 3.1 Flash TTS on April 15. It landed with a 1,211 Elo on the Artificial Analysis TTS leaderboard — enough to make it the company’s best-sounding speech model to date, and close enough to ElevenLabs that the comparison is no longer embarrassing for Google. The bigger news is what sits underneath that number.

The model ships with more than 200 audio tags. Developers can steer delivery by typing [whispering] or [with a vocal smile] directly into the prompt. You can write “Brixton, London accent, infectious energy, slightly breathless” and the model treats it like a director’s note. Simon Willison tested regional British accents and reported they actually hold up.

Multi-speaker dialogue is native. That matters. Anyone who has tried to stitch a two-person podcast from separate voice generations knows the pauses never quite land and the energy resets between turns. Gemini 3.1 Flash TTS keeps the conversational thread intact across speakers — useful for audiobook dialogue and voice agents that simulate handoffs without audible seams.

The economics are the real lever. Google is charging $0.018 per minute of audio output through the Gemini API. That sits below most ElevenLabs agent tiers and roughly matches OpenAI’s pricing. At 10,000 agent-hours a month, the gap against premium ElevenLabs plans runs into five figures. Artificial Analysis placed the model in what it calls the “most attractive quadrant” — high quality at low cost. That’s the exact position ElevenLabs used to own alone.

Availability is broad from day one. Developers reach it through Google AI Studio and the Gemini API with the model ID gemini-3.1-flash-tts-preview. Enterprises hit the same model through Vertex AI. Workspace users get it inside Google Vids, where Google also shipped 16 new languages for AI voiceover on the same day. Total language support sits past 70.

One caveat worth stating: this is a preview release. Input is capped at 8,192 tokens, output at 16,384. The model can’t do function calling or live API streaming. Every clip carries a SynthID watermark baked into the audio itself — a reasonable safety move, but worth knowing if you’re planning to layer additional processing on top. For a pure voice-cloning studio workflow with hand-tuned emotional range, ElevenLabs still has the edge. The Google model is built for scale and control, not for replacing a studio sound engineer.

Developers building voice agents were waiting for exactly this release. Google shipped something priced for production and expressive enough to use, wrapped inside the API they were already calling. ElevenLabs just lost its easiest sales pitch.


Sources

Google Blog   Simon Willison   SiliconANGLE

This article is AI-generated.