Skip to main content

Gemini 3.1 Flash Live API: Model ID, Pricing, and Quickstart Guide (March 2026)

A
15 min readAPI Guides

If you mean Google's new real-time voice model, the official model ID is `gemini-3.1-flash-live-preview` and you use it through the Gemini Live API. Start new voice-agent work on 3.1, but do not assume it is a drop-in replacement for Gemini 2.5 Flash Live.

Gemini 3.1 Flash Live API: Model ID, Pricing, and Quickstart Guide (March 2026)

If by "Gemini 3.1 Flash Live API" you mean Google's new real-time voice model, the official model you want is gemini-3.1-flash-live-preview, and the API surface is the Gemini Live API. That naming matters because Google split the useful details across several pages: the model page, the Live API overview, the capabilities guide, the pricing page, the ephemeral token guide, and the March 26, 2026 launch post.

The short answer is this: start new low-latency voice-agent builds on Gemini 3.1 Flash Live today. But do not treat it as a clean drop-in upgrade from gemini-2.5-flash-native-audio-preview-12-2025. Google improved the voice quality, latency posture, and operational ceiling, yet it also changed the thinking config, server event shape, incremental input model, and tool-use behavior. If your 2.5 stack depended on async function calling, proactive audio, or affective dialog, migrating blindly will make your app worse before it makes it better.

Evidence note: this guide is based on Google's official developer docs and launch post, all rechecked on March 28, 2026. Where Google's own public pages still conflict or stay vague, I keep the uncertainty explicit instead of filling it in from memory.

TL;DR

If you need to know this firstCurrent answer
Exact model IDgemini-3.1-flash-live-preview
Exact API surfaceGemini Live API over a stateful WebSocket connection
Launch dateMarch 26, 2026
Best starting use casereal-time voice agents that need low latency, multimodal awareness, and stronger audio quality
Default recommendationstart new voice-agent work on 3.1
Main reason to stay on 2.5 for nowyou still need async function calling, proactive audio, or affective dialog
Biggest migration changesthinkingBudget becomes thinkingLevel, one server event can contain multiple parts, send_realtime_input replaces incremental send_client_content, and tool calls are synchronous only
Paid pricing shapetext in \$0.75 / 1M tokens, audio in \$3 / 1M tokens or \$0.005 / min, image/video in \$1 / 1M tokens or \$0.002 / min, text out \$4.50 / 1M tokens, audio out \$12 / 1M tokens or \$0.018 / min
Browser-safe pathmint ephemeral tokens on your backend, then connect directly from the client
Hidden gotchadefault turn coverage now includes all video frames, which can raise costs if you stream video continuously

What Google actually launched on March 26, 2026

Google's launch post describes Gemini 3.1 Flash Live as its latest real-time audio model and says it is available to developers in preview through the Gemini Live API in Google AI Studio. The official model page narrows that into the exact developer contract: the model code is gemini-3.1-flash-live-preview, it accepts text, images, audio, and video, and it is built for low-latency dialogue with acoustic nuance detection, numeric precision, and multimodal awareness.

That may sound like a branding detail, but it clears up a common setup mistake. There is no separate standalone product called "Gemini 3.1 Flash Live API." There is a Gemini Live API, and this is one of the models you run on top of it. The Live API itself is a stateful WebSocket session designed for streaming interactions, interruptions, and continuous media, not a normal request-response loop like generateContent.

The model is stronger than a plain "voice bot" label suggests. Google's docs say it supports function calling and Google Search grounding, which means the intended use case is not passive speech synthesis. It is voice-ready agents that can hear, reason, call tools, and respond naturally in the same session. The same docs also state a January 2025 knowledge cutoff, so if your agent needs current information, you should plan to add search grounding or your own retrieval instead of assuming the model's base knowledge is enough.

There are two technical nuances worth catching early. First, the model page lists text and audio output, but the Live API capabilities guide says native audio models only support the AUDIO response modality. In practice, if you want readable text alongside the spoken response, Google's guidance is to use output audio transcription rather than assuming a plain text-only response path. Second, Google's launch post says all generated audio is watermarked with SynthID. If you are building consumer-facing voice experiences, that safety behavior is part of the product contract, not an afterthought.

Should you start with Gemini 3.1 Flash Live or stay on 2.5?

For new builds, Gemini 3.1 Flash Live is the right default. Google's own launch post positions it as the new top-quality audio model, and the migration notes show that the docs team expects people to move from gemini-2.5-flash-native-audio-preview-12-2025 to 3.1. But "move to 3.1" is not the same thing as "3.1 is a strict superset."

The practical question is whether your current app depends on capabilities that 2.5 still exposes and 3.1 does not. That is why the migration table matters more than the benchmark headlines.

Practical differenceGemini 3.1 Flash LiveGemini 2.5 Flash Live
Model IDgemini-3.1-flash-live-previewgemini-2.5-flash-native-audio-preview-12-2025
Launch / latest updatelaunched March 26, 2026latest update September 2025
Output token limit65,5368,192
Thinking controlthinkingLevel with minimal, low, medium, highthinkingBudget
Server event shapeone event can contain multiple parts at onceone part per event
Incremental text flowsend_client_content only seeds initial history; use send_realtime_input during live interactionsend_client_content works throughout the conversation
Default turn coverageTURN_INCLUDES_AUDIO_ACTIVITY_AND_ALL_VIDEOTURN_INCLUDES_ONLY_ACTIVITY
Async function callingnot supportedsupported
Proactive audionot supportedsupported
Affective dialognot supportedsupported
Best fit todaynew voice-agent builds and teams that want the newest model qualityexisting 2.5 stacks that still depend on the missing 2.5-only features

Two things jump out from that comparison.

First, 3.1 is better when your core problem is live conversation quality. Google's launch post says it improved tonal understanding, handles complex tasks more reliably, and performs better on audio-agent benchmarks. The output-token jump from 8,192 to 65,536 also gives 3.1 more room for richer turn handling and longer responses than the 2.5 model page advertised.

Second, 3.1 is not yet the safest choice for every existing 2.5 deployment. If your agent relied on behavior: NON_BLOCKING function declarations so the model could keep talking while tools ran in the background, that pattern does not carry over today. Google's capabilities guide is explicit that Gemini 3.1 Flash Live currently uses sequential tool calling only. The same is true if your product experience leaned on proactive audio or affective dialog. Those are 2.5-era capabilities that 3.1 has not brought forward yet.

So the decision is simple enough to use:

  • If you are starting from zero, build on 3.1.
  • If you already have a 2.5 production flow, migrate only after you check whether sync tools, changed event parsing, and missing 2.5-only features are acceptable.

Gemini 3.1 Flash Live and Gemini 2.5 Flash Live compared across thinking, events, tools, and missing features

Pricing is finally readable enough to model live voice cost

Google's pricing page is unusually useful here because it exposes both token-based and minute-based rates. For normal text models, developers often have to estimate everything from tokens. For a live voice product, per-minute pricing is much closer to how teams actually think about support calls, tutoring sessions, intake flows, and voice copilots.

MeterCurrent paid pricing
Text input\$0.75 / 1M tokens
Audio input\$3.00 / 1M tokens or \$0.005 / minute
Image/video input\$1.00 / 1M tokens or \$0.002 / minute
Text output\$4.50 / 1M tokens
Audio output\$12.00 / 1M tokens or \$0.018 / minute
Grounding with Google Search5,000 free prompts per month shared across Gemini 3, then \$14 / 1,000 search queries

That lets you do useful operator math immediately. If you assume a voice session with continuous audio flowing both ways, the official per-minute rates imply roughly $0.023 per minute of audio traffic alone. That means a 10-minute session is about $0.23 before you add search grounding, images, video, app overhead, or your own infrastructure. That is not a quote from Google. It is a direct calculation from the published \$0.005 / minute input-audio rate plus the \$0.018 / minute output-audio rate.

Search grounding is the other cost line that deserves more attention than it usually gets. At \$14 / 1,000 search queries after the free shared allowance, each grounded query is about $0.014. Five searches inside a live call adds around $0.07. That is still not expensive in absolute terms, but it becomes material if your product grounds aggressively on every turn.

The more dangerous cost trap is video. Google's 3.1 migration notes say the default turn coverage now includes all video frames instead of only detected activity. If your old 2.5 app kept a camera stream running continuously while voice was the real task, 3.1 can quietly meter more image/video input than you expected. That is why a low-latency audio agent should not casually stream video just because the API allows it.

One thing Google does not publish cleanly on the public docs page is exact Live-model RPM/RPD by tier. The rate-limits page tells developers to check AI Studio for their active quotas, and it also warns that preview models have tighter limits. That is the right place to be conservative. Use the public pricing page for cost shape, but use AI Studio for the current quota reality in your project.

The fastest working path is backend first, browser second

If you want the shortest safe path, start server-to-server with the GenAI SDK. The Live API overview says that is the default secure architecture, and the capabilities guide shows the basic connection pattern: open a Live session, request AUDIO, then send send_realtime_input events as audio, text, or video arrives.

python
import asyncio from google import genai client = genai.Client() MODEL = "gemini-3.1-flash-live-preview" async def main(): config = {"response_modalities": ["AUDIO"]} async with client.aio.live.connect(model=MODEL, config=config) as session: await session.send_realtime_input( text="Say hello and introduce yourself in one sentence." ) async for response in session.receive(): if response.server_content and response.server_content.model_turn: for part in response.server_content.model_turn.parts: if part.inline_data: audio_bytes = part.inline_data.data # Stream or play audio_bytes here. if response.text: print(response.text) if response.server_content and response.server_content.turn_complete: break asyncio.run(main())

When you move from a toy example to real audio, Google's docs say your input should be raw 16-bit PCM audio at 16kHz, little-endian, passed with a MIME type such as audio/pcm;rate=16000. Output audio comes back as 24kHz PCM. The same guide also says audio-only sessions are limited to 15 minutes, and audio-plus-video sessions are limited to 2 minutes, unless you adopt the session-management patterns Google documents for longer conversations.

If you need a browser-direct path, Google's answer is not "ship your API key to the frontend." It is ephemeral tokens. The ephemeral-token guide says they are currently only compatible with the Live API, which makes sense: this is exactly the kind of surface where teams want client-side latency without exposing a long-lived credential.

The important defaults are:

  • you usually get 1 minute to start a new session with the issued token
  • you usually get 30 minutes to keep sending messages over that connection
  • the browser uses the token as if it were an API key
  • if you need reconnect behavior, Google's guide points you to session resumption

That creates a clean deployment rule:

  • backend only: simplest secure path
  • browser direct: only after your backend mints ephemeral tokens

Gemini Live integration paths showing backend WebSocket flow and browser-direct flow with ephemeral tokens between backend and client

The migration traps that will waste your first day

The fastest way to lose time on Gemini 3.1 Flash Live is to assume that your 2.5 integration will fail loudly. Some parts will. Others will fail softly and just behave strangely.

1. Stop sending thinkingBudget.
Gemini 3.1 uses thinkingLevel instead. Google's migration notes say the default is minimal for low latency. If you keep tuning 2.5-style thinking budgets, you are tuning the wrong control surface.

2. Process every part in every server event.
The 3.1 docs say a single server event can now include multiple parts, such as audio chunks plus transcript content. Old code that assumed one part per message can silently drop data. This is one of the most important migration fixes because the session will still look "mostly fine" while losing valuable information.

3. Use send_realtime_input for live updates.
On 2.5, send_client_content could keep doing incremental work during the conversation. On 3.1, Google says it is only for seeding initial history when you configure initial_history_in_client_content. If your app updates the conversation live, switch that path to send_realtime_input.

4. Treat tool calls as blocking for now.
This is the biggest operational downgrade from 2.5 for agent builders. The capabilities guide is explicit: Gemini 3.1 Flash Live does not support non-blocking function behavior yet. The model waits for the tool response before it continues. If your old experience depended on background tools while the conversation kept moving, you need to redesign the tool loop or stay on 2.5 for longer.

5. Remove proactive-audio and affective-dialog config.
Those features are not yet supported on 3.1. Do not leave dead config in place and then debug phantom behavior.

6. Re-think video before you stream it.
Because the default turn coverage now includes all video frames, blindly forwarding a camera feed is a billing decision as much as a UX decision. If the user's voice is the real signal and video is only occasionally useful, gate video harder or send it conditionally.

7. If you want text, plan for transcription rather than a text-only response modality.
Google's docs are a little awkward here because the model page lists text and audio output, while the capabilities guide says native audio models only support AUDIO response modality. The safest practical reading is this: if text matters to your product, turn on transcription support and handle transcripts explicitly instead of assuming the model behaves like a normal text-first endpoint.

When Gemini 3.1 Flash Live is the wrong choice

A lot of developers will still pick 3.1 after reading this, and that is reasonable. But there are real cases where it is not the best next move.

If you are not building a real-time voice product, the Live API adds needless complexity.
WebSocket session management, PCM streaming, interruptions, and ephemeral-token logic are worth it for voice-first experiences. They are not worth it for ordinary multimodal chat flows that could run on a standard Gemini text model.

If your current 2.5 system depends on async tool use, 3.1 may feel worse even if the voice sounds better.
That is the cleanest reason to delay migration. Voice quality is not the only dimension that matters in a production agent.

If you cannot issue ephemeral tokens safely, do not connect the browser directly.
The whole point of the token guide is to avoid shipping a long-lived API key to untrusted clients. If your backend is not ready to provision short-lived auth, keep the Live session on your server for now.

If you only need speech synthesis, use a speech-generation model instead of a live dialogue model.
The Live model is designed for conversational turn handling, not just "convert text to audio as cheaply as possible." When dialogue intelligence is not the job, the larger Live contract is overkill.

Cost and session guardrails showing audio cost per minute, session-duration ceilings, grounded-search cost, and the video-frame warning

FAQ

What is the exact model string for Gemini 3.1 Flash Live?

Use gemini-3.1-flash-live-preview.

Is Gemini 3.1 Flash Live GA?

No. Google's model page and Live docs both label it as preview as of March 2026.

Yes. Google's docs say it supports function calling and Search grounding. But the current limitation is important: function calling is synchronous only.

Can I connect directly from a web app?

Yes, but Google's recommended production path is to mint ephemeral tokens on your backend and let the client use those tokens to open the Live API session. Do not expose a long-lived API key in the browser.

How long can sessions run?

Google's capabilities guide says audio-only sessions are limited to 15 minutes, and audio-plus-video sessions are limited to 2 minutes. If you need longer interactions, use the documented session-management and resumption patterns.

Does the model return text or only audio?

The model page lists text and audio output, but the Live API capabilities guide says native audio models use AUDIO response modality. If your application needs readable text, enable or process output audio transcription instead of assuming a plain text-first output contract.

Is output audio watermarked?

Yes. Google's launch post says all audio generated by Gemini 3.1 Flash Live is watermarked with SynthID.

What is the best migration rule in one sentence?

Start new voice-agent work on Gemini 3.1 Flash Live, but keep Gemini 2.5 Flash Live only if your current design still depends on async tools, proactive audio, or affective dialog.

Share:

laozhang.ai

One API, All AI Models

AI Image

Gemini 3 Pro Image

$0.05/img
80% OFF
AI Video

Sora 2 · Veo 3.1

$0.15/video
Async API
AI Chat

GPT · Claude · Gemini

200+ models
Official Price
Served 100K+ developers
|@laozhang_cn|Get $0.1