Realtime API vs Transcription Pipeline Cost: When Live Voice Is Worth the Spend

•Jun 14, 2026•13 min read•API Guides

Realtime is worth paying for when live spoken interaction is the product. For transcripts, archives, QA, summaries, and compliance, start with a transcription-first route and model the variable components separately.

Audio route decision board for Realtime voice, live transcript, bounded transcript, pipeline, and hybrid options.

Realtime API is worth the extra cost when your product sells live spoken interaction: interruption, fast turn-taking, tool use during a call, and a voice response that feels present. If the product mainly needs a transcript, summary, archive, QA review, compliance trail, or analytics feed, start with a transcription-first route.

Use the route before the price table:

Live spoken assistant: use gpt-realtime-2 when voice response quality, interruption, and low latency are the value.
Live transcript only: use gpt-realtime-whisper when the product needs streaming text but not a spoken assistant.
Bounded audio to text: use gpt-4o-transcribe or gpt-4o-mini-transcribe when the audio can be uploaded or processed as a request.
Custom pipeline: combine STT, a text model, optional TTS, telephony, storage, monitoring, and QA only when those components are actually needed.
Hybrid: reserve Realtime for high-value live moments and use transcription-first processing for archive, review, and summaries.

As of June 14, 2026, OpenAI lists gpt-realtime-whisper at $0.017/minute, gpt-4o-transcribe at an estimated $0.006/minute, and gpt-4o-mini-transcribe at an estimated $0.003/minute. By direct conversion, that is $1.02/hour, $0.36/hour, and $0.18/hour for duration-billed transcription. gpt-realtime-2 voice-agent sessions are different: audio input and output are token-billed, and one counted hour of user audio plus one generated hour of assistant audio creates a $5.76/hour media-token floor before text tokens, tool calls, repeated conversation history, optional input transcription, and pipeline or telephony components.

The stop rule is simple: do not pay for spoken assistant output when text is enough. The rest of the page gives the worksheet for deciding when low-latency speech is worth the extra spend, how Realtime costs grow, and which pipeline costs must stay variable until you verify them for your own stack.

Fast answer: choose by product job

The cost decision is not "Realtime API versus transcription pipeline" in the abstract. It is the product job first, then the OpenAI route, then the billing unit.

Product job	First route to price	Cost unit to model	Use it when	Stop rule
Live spoken assistant	`gpt-realtime-2`	audio and text tokens per Response	users need interruption, turn-taking, tool use, and spoken output during the session	if text output is enough, do not start here
Live transcript only	`gpt-realtime-whisper`	minutes of live audio	the product needs streaming text while someone is speaking	if audio can wait, price bounded transcription instead
Bounded audio to text	`gpt-4o-transcribe` or `gpt-4o-mini-transcribe`	minutes of submitted audio	files, recordings, uploads, post-call review, summaries, QA, or compliance can run after capture	if live deltas are required, use a realtime transcription route
Custom pipeline	STT -> text model -> optional TTS -> operations layer	separate component meters	teams need control, vendor mix, telephony fit, auditability, or component-level optimization	do not add TTS, telephony, or a text model unless the workflow needs them
Hybrid	Realtime for the live moment, transcription-first for the back office	combined meters	live help creates value, but archive, review, and analytics do not need spoken output	measure where the live session ends and back-office processing begins

That route table prevents the most common budget error. A voice agent and a transcript are both audio products, but they do not buy the same thing. The voice agent buys an interactive spoken loop. A transcription pipeline buys text and downstream processing.

OpenAI audio routes use different cost units before any fair comparison.

Current OpenAI prices that matter

Use OpenAI's current price rows as anchors, then add your own measured workload. Checked on June 14, 2026, the OpenAI pricing page lists these rows for the routes in this decision:

Route	Current OpenAI price row	Hourly intuition	What it does not include
`gpt-realtime-whisper`	`$0.017/minute`	`$1.02/hour` by direct conversion	custom text model work, storage, monitoring, telephony, and non-OpenAI components
`gpt-4o-transcribe`	estimated `$0.006/minute`	`$0.36/hour` by direct conversion	post-processing, summaries, classification, storage, and orchestration
`gpt-4o-mini-transcribe`	estimated `$0.003/minute`	`$0.18/hour` by direct conversion	accuracy review, domain tuning, post-processing, and operations work
`gpt-realtime-2` audio input	`$32.00/1M audio input tokens`	`$1.152/hour` for one counted hour of user audio at 1 token per 100 ms	assistant audio, text, tools, history growth, optional transcription, and pipeline components
`gpt-realtime-2` audio output	`$64.00/1M audio output tokens`	`$4.608/hour` for one generated hour of assistant audio at 1 token per 50 ms	user audio, text, tools, history growth, optional transcription, and pipeline components

The duration-billed transcription math is direct:

text
gpt-realtime-whisper:      60 minutes * $0.017  = $1.02/hour
gpt-4o-transcribe:         60 minutes * $0.006  = $0.36/hour
gpt-4o-mini-transcribe:    60 minutes * $0.003  = $0.18/hour

The gpt-realtime-2 media-token floor is also straightforward, but it is only a floor:

text
User audio input:      36,000 tokens/hour * $32 / 1,000,000 = $1.152/hour
Assistant audio output: 72,000 tokens/hour * $64 / 1,000,000 = $4.608/hour
Media-token floor: $1.152 + $4.608 = $5.76/hour

Use that $5.76/hour number as a warning label, not as a quote. It assumes one counted hour of user audio and one generated hour of assistant audio, then excludes the parts that often move the real bill: text tokens, repeated conversation context, tools, optional input transcription, session design, telephony, and any pipeline outside the Realtime session.

Why Realtime voice-agent cost grows

The Realtime cost guide explains why a voice-agent session is not priced like a simple audio file. Cost accrues when a Response is created and is based on input and output tokens, with input transcription billed separately when enabled. OpenAI also notes that connections and network bandwidth are not currently billed, but that does not make the session free while it is open.

The practical drivers are:

User audio: counted at 1 token per 100 ms.
Assistant audio: counted at 1 token per 50 ms.
Response count: each Response is a new model generation event.
Conversation history: previous conversation content is sent again, so later turns can become more expensive.
Empty audio control: VAD can filter empty input audio unless the client manually adds it.
Text and tool work: instructions, tool schemas, tool results, and text output still matter.
Optional input transcription: if enabled, it uses a separate transcription model and rate card.

Realtime voice-agent sessions are shaped by user audio, assistant audio, Responses, history growth, VAD, and input transcription.

This is why "Realtime costs X dollars per hour" is usually too weak for launch planning. A quiet 60-minute support call with short assistant replies can have a very different bill from a 20-minute tutoring session where the assistant speaks often, uses tools, and carries a long instruction and history context through many Responses.

You can reduce waste without changing the product route:

Keep VAD configured so silence is not manually added as input.
End or summarize long sessions when the old history no longer improves the next answer.
Keep tool schemas and system instructions tight.
Avoid assistant monologues when a short spoken answer is enough.
Enable input transcription only when the product needs a transcript from inside the Realtime session.
Measure real Response count and assistant-audio duration before committing to a public margin.

The upside is also real. gpt-realtime-2 buys a native spoken interaction loop: low latency, interruption handling, and voice response quality in one session. If those traits change conversion, containment, accessibility, or completion rate, the extra cost can be the product value rather than waste.

Transcription-first pipeline cost model

A transcription-first pipeline starts cheaper because the first paid step can be plain speech-to-text. For files and bounded audio, the Speech to text guide points to transcription routes such as gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize; file uploads are capped at 25 MB and are a different shape from live transcript deltas. For live text without a spoken assistant, the Realtime transcription guide maps the job to gpt-realtime-whisper.

Do not stop the worksheet at STT if the product does more than produce text. Count the stack honestly:

Component	Count as	Price treatment
Audio capture and transport	app, browser, WebRTC, telephony, or recording infrastructure	variable; depends on your product stack
STT	`gpt-realtime-whisper`, `gpt-4o-transcribe`, or `gpt-4o-mini-transcribe`	official OpenAI minute rows can anchor the estimate
Text model	summarization, extraction, routing, QA, coaching, moderation, or agent logic	variable; price the actual model, tokens, cache, and retries
Optional TTS	speech output after text processing	variable unless separately verified for the launch route
Telephony	PSTN, SIP, call recording, carrier fees, phone numbers, compliance features	variable; verify provider and region
Storage and retrieval	audio files, transcripts, embeddings, logs, retention policy	variable; count retention and privacy requirements
Monitoring and QA	human review, audits, metrics, failure replay, alerting	variable; often larger than the STT row in regulated workflows

A transcription-first pipeline starts with STT and adds text model, optional TTS, telephony, storage, monitoring, and stop rules.

The pipeline can still be cheaper and more predictable than a live voice-agent session, especially when the product does not need spoken output. It can also be easier to debug because each component produces an artifact: audio, transcript, text-model output, summary, classification, or audit log.

The tradeoff is product latency and integration load. A cascaded STT -> text model -> TTS design has more moving parts. Streaming each component can reduce delay, but it does not make the pipeline the same as a native spoken session. Choose it because the workflow is text-centered or because component control matters, not because it sounds universally cheaper.

When live voice is worth paying for

Pay for gpt-realtime-2 when spoken interaction changes user behavior during the session. Good candidates include:

sales or onboarding calls where interruption and turn-taking affect completion
tutoring, coaching, or accessibility flows where the user should not wait for a transcript-first loop
voice agents that call tools while the user is still in the conversation
consumer voice experiences where speech quality and low latency are the interface
support containment where a completed spoken resolution avoids a human escalation

The pilot question is not "Is Realtime more expensive than STT?" It is "Does live speech produce enough value to justify the additional meter?" Measure:

Metric	Why it matters
real user talk time	drives counted user audio
assistant speech time	drives audio output and often dominates the media-token floor
Responses per session	controls how often the model generates
average history size by turn	reveals late-session cost growth
tool calls per session	captures hidden text and tool context
containment or completion lift	tells you whether the voice loop pays back
human fallback rate	catches cases where a cheaper transcript route would have worked

If those metrics show that live spoken interaction improves the business or user outcome, Realtime can be the correct spend even when a transcription route has a lower raw hourly row.

When transcription-first wins

Start transcription-first when text is the product artifact. Typical winners include meeting summaries, call QA, compliance review, searchable archives, support analytics, coaching notes, medical or legal intake drafts that need review, asynchronous voice notes, and post-call classification.

The stop rules are practical:

If the user does not need the assistant to speak back, do not pay for assistant audio output.
If the transcript can arrive after the audio ends, compare bounded transcription before live transcription.
If the product needs live captions but not a spoken assistant, price gpt-realtime-whisper before gpt-realtime-2.
If summaries and classifications run after capture, budget them as text-model work outside the STT row.
If the workflow needs a compliance trail, pipeline artifacts can be easier to inspect than a live spoken loop alone.

This route also gives teams finer control over quality gates. You can store the original audio, rerun transcription, compare model outputs, inspect prompt changes, and batch non-urgent work. For many operations teams, that control is worth more than shaving a few seconds from the transcript path.

Hybrid budget pattern

Hybrid is often the best production shape. Use Realtime for the live segment where speech changes the result, then use transcription-first processing for the parts that do not need spoken output.

A simple hybrid sequence looks like this:

Start a gpt-realtime-2 session only for the live interaction that needs interruption, turn-taking, and speech.
Capture session metadata: user audio duration, assistant audio duration, Response count, tool calls, and whether input transcription was enabled.
Store or export the transcript artifact only if the product and privacy policy need it.
Run post-call summaries, QA, compliance classification, analytics, and search indexing through text or transcription-first routes.
Review sample sessions weekly until the Realtime-live boundary and the back-office boundary are stable.

This keeps the most expensive route attached to the part of the product that actually needs it. A live onboarding agent might use Realtime for the call, then use lower-cost post-processing for CRM notes and QA. A call-center monitor might use live transcription for supervisor visibility, then run summaries and compliance checks after the call. A voice note app might avoid Realtime entirely because it only needs accurate text and a clean summary after recording.

Budget worksheet

Use three estimates: transcription-only, Realtime media floor, and full pipeline. Then replace the placeholders with measured pilot data.

Worksheet line	Formula
Live transcript-only cost	`live_audio_minutes * $0.017` for `gpt-realtime-whisper`
Bounded high-accuracy transcription cost	`audio_minutes * $0.006` for `gpt-4o-transcribe`
Bounded low-cost transcription cost	`audio_minutes * $0.003` for `gpt-4o-mini-transcribe`
Realtime user-audio floor	`user_audio_hours * $1.152` for `gpt-realtime-2` audio input
Realtime assistant-audio floor	`assistant_audio_hours * $4.608` for `gpt-realtime-2` audio output
Realtime media-token floor	`user_audio_floor + assistant_audio_floor`
Realtime session estimate	`media floor + text tokens + tool tokens + history growth + optional input transcription`
Pipeline estimate	`STT + text model + optional TTS + telephony + storage + monitoring + QA`

Run the worksheet with low, typical, and high usage. For Realtime, change assistant speech time and Response count before changing user talk time; assistant output and repeated context often reveal the real margin risk. For a transcription-first route, change audio duration, text-model output length, retry rate, storage retention, and human-review load.

Before launch, capture evidence from real pilot sessions:

median and p95 user audio duration
median and p95 assistant audio duration
Response count by session
average conversation history size by late turn
input transcription usage and model
text tokens used by tools, summaries, and follow-up processing
failed or retried sessions
human review minutes per transcript
current OpenAI price rows on the launch date

Recheck prices on the day you deploy. Model IDs, availability, minute rows, token rows, and account-specific access can change, and old calculators age quickly.

For adjacent cost-methodology reading, see the broader API budgeting guide in Claude API vs OpenAI API Pricing. If you are comparing standalone speech-to-text products beyond OpenAI, the Grok Speech-to-Text API guide shows a separate STT route-selection pattern.

FAQ

Is Realtime API always more expensive than a transcription pipeline?

No. Realtime voice-agent sessions usually have a higher cost floor than plain transcription, but the better question is whether live spoken interaction creates value. If the product needs interruption, low latency, tool use during the call, and spoken output, Realtime can be worth paying for. If the product needs text artifacts, a transcription-first route usually starts cheaper and is easier to budget.

Is `gpt-realtime-whisper` the same as using `gpt-realtime-2`?

No. gpt-realtime-whisper is for live transcription-only workflows that need transcript deltas without spoken assistant output. gpt-realtime-2 is the Realtime voice-agent route for live spoken assistant sessions. Treating both as one "Realtime API" row is the source of many bad cost comparisons.

Why not call `$5.76/hour` the Realtime hourly price?

Because it is only the media-token floor for one counted hour of user audio plus one generated hour of assistant audio at the current gpt-realtime-2 audio token rows. It excludes text tokens, repeated conversation history, tools, optional input transcription, special tokens, and any telephony or pipeline components outside the Realtime session.

Which route should I use for live captions?

Start with realtime transcription-only. The OpenAI route to price is gpt-realtime-whisper when the product needs live transcript deltas but not a spoken assistant. If the audio can wait until recording ends, compare gpt-4o-transcribe and gpt-4o-mini-transcribe instead.

Should a custom STT -> LLM -> TTS pipeline always beat Realtime?

No. A pipeline can be cheaper for text-centered work and more controllable for compliance, telephony, debugging, and vendor mix. It also adds integration work and component latency. If the user experience depends on natural interruption and spoken response quality, a native Realtime voice-agent session may be the better route even with a higher meter.

What is the safest production rule?

Pick the route first, use current official OpenAI rows for the parts OpenAI prices directly, label everything else as a variable, and pilot with real sessions before setting margin. Do not pay for spoken assistant output when text is enough, and do not call a media-token floor a final bill.

#Realtime API#Transcription API#OpenAI Pricing#Voice Agents#Speech to Text