How do AI voice agents work?

AI voice agents use a three-stage pipeline: speech-to-text (STT) converts audio to text, an LLM processes the text and generates a response, and text-to-speech (TTS) converts the response back to audio. Production systems stream all three stages simultaneously to achieve sub-500ms response times.

What latency do AI voice agents need?

Production voice agents need sub-500ms end-to-end response time for natural conversation flow. Anything above 800ms causes users to talk over the AI, derailing conversations and dropping completion rates. Achieving this requires a streaming pipeline and careful turn-taking detection.

What are the biggest challenges in building voice AI?

The three biggest challenges are latency optimization across the STT-LLM-TTS pipeline, barge-in detection (handling when users interrupt the AI mid-sentence), and natural turn-taking management. Users speak more directly and less patiently to AI, requiring thorough edge case handling.

How much does it cost to run an AI voice agent?

Total cost is $0.05-0.25 per minute, or $0.25-1.50 for a typical 5-minute call. This includes STT ($0.005-0.02/min), LLM inference ($0.01-0.10/call), TTS ($0.01-0.05/min), and telephony ($0.01-0.05/min). Compare to $1-3 per minute for human agents.

Operations & Automation

AI Voice Agents: When to Build and When to Skip

By Ashit VoraMarch 14, 20267 min

What Matters

-Production voice AI requires a streaming pipeline processing STT, LLM inference, and TTS simultaneously to achieve sub-500ms response times.
-The STT-to-TTS pipeline has three latency bottlenecks: speech-to-text transcription, LLM reasoning, and text-to-speech synthesis - each must be optimized independently.
-Users tolerate up to 800ms of silence before conversations feel unnatural; above that, completion rates drop significantly.
-Handling interruptions (barge-in detection) and turn-taking naturally are harder engineering problems than the core speech pipeline.

AI voice agents handle phone calls, voice commands, and spoken interactions autonomously. Unlike text-based AI agents that process typed input, voice agents add two demanding constraints: sub-second latency and natural-sounding speech. Get either wrong and the conversation feels broken.

TL;DR

AI voice agents use a three-stage pipeline: speech-to-text (STT), LLM reasoning, and text-to-speech (TTS). The total round-trip must stay under 800ms to feel natural. Current top-performing systems achieve 400-600ms. The biggest technical challenges are reducing latency at each stage, handling interruptions (barge-in), and managing turn-taking. Voice AI is production-ready for structured conversations (appointment booking, order status) but still struggles with open-ended, emotionally complex calls.

Voice Agent Pipeline

Stage 1: Speech-to-Text (STT)

Audio is captured and converted to text using streaming providers like Deepgram (100-200ms), Google Speech-to-Text, or Whisper.

100-300ms

Stage 2: LLM Reasoning

Transcribed text goes to the LLM, which generates a response based on conversation context, tools, and instructions. Streaming starts sending to TTS as first tokens arrive.

200-500ms

Stage 3: Text-to-Speech (TTS)

LLM response is converted to natural-sounding speech using ElevenLabs (highest quality), Cartesia (low latency), or PlayHT. Audio streams as sentences are generated.

100-200ms

The Voice Pipeline

Every voice agent follows the same three-stage pipeline:

Stage 1: Speech-to-Text (STT)

The user speaks. The audio is captured and converted to text. This stage takes 100-300ms depending on the provider and whether you use streaming.

Key providers: Deepgram (fastest, 100-200ms streaming), Google Speech-to-Text, Azure Speech Services, AssemblyAI, Whisper (open source, higher latency).

Optimization: Use streaming STT. Instead of waiting for the user to finish speaking, process audio in chunks as it arrives. Deepgram's streaming API starts returning partial transcripts within 100ms.

Stage 2: LLM Reasoning

The transcribed text goes to the LLM. The LLM generates a response based on the conversation context, tools, and instructions. This is the slowest stage - typically 200-500ms for the first tokens.

Optimization: Use streaming LLM responses. Start sending text to the TTS engine as soon as the first tokens arrive, not after the full response is generated. This overlaps Stage 2 and Stage 3.

Stage 3: Text-to-Speech (TTS)

The LLM's text response is converted to speech audio. Modern TTS engines produce natural-sounding speech with emotion and pacing.

Key providers: ElevenLabs (highest quality), Cartesia (low latency), PlayHT, Azure Neural Voices, Google WaveNet.

Optimization: Stream TTS output. Start playing audio from the first sentence while later sentences are still being generated.

Total Pipeline Latency

Stage	Standard	Optimized (Streaming)
STT	300-500ms	100-200ms
LLM	500-2000ms	200-400ms (first tokens)
TTS	200-500ms	100-200ms (first audio)
Total	1000-3000ms	400-800ms

The optimized pipeline achieves latency comparable to human conversation pauses (300-700ms between turns).

400-800msOptimized voice pipeline latency

Down from 1,000-3,000ms with non-streaming approaches.

Pipeline Latency: Standard vs Optimized

Metric	Standard (Non-Streaming)	Optimized (Streaming)
STT	300-500ms	100-200ms
LLM First tokens, not full response	500-2,000ms	200-400ms
TTS First audio chunk	200-500ms	100-200ms
Total round-trip Comparable to human conversation pauses	1,000-3,000ms	400-800ms

STT

Standard (Non-Streaming)

300-500ms

Optimized (Streaming)

100-200ms

LLM

First tokens, not full response

Standard (Non-Streaming)

500-2,000ms

Optimized (Streaming)

200-400ms

TTS

First audio chunk

Standard (Non-Streaming)

200-500ms

Optimized (Streaming)

100-200ms

Total round-trip

Comparable to human conversation pauses

Standard (Non-Streaming)

1,000-3,000ms

Optimized (Streaming)

400-800ms

Handling Interruptions (Barge-In)

Humans interrupt each other constantly. A voice agent must handle interruptions gracefully:

Detect the interruption: Monitor the audio input while the agent is speaking. When the user starts talking, the agent should stop.
Stop playback: Immediately cease the current TTS output.
Process the interruption: Send the new user speech through the STT pipeline.
Discard unspoken text: If the LLM generated text that wasn't spoken yet, decide whether to discard it or save it for later.
Respond to the new input: Generate a new response that acknowledges the context shift.

Key Insight

Barge-in detection is one of the hardest problems in voice AI. False positives (background noise triggering interruption) break the conversation flow. False negatives (missing a real interruption) make the agent seem unresponsive.

Turn-Taking

Knowing when the user has finished speaking is surprisingly difficult. Silence alone isn't a reliable signal - people pause mid-sentence to think.

Approaches to end-of-turn detection:

Silence duration: Wait for 500-700ms of silence. Simple but causes awkward pauses.
Prosodic cues: Detect falling pitch and slowing tempo that signal sentence endings. More natural but harder to implement.
Semantic analysis: Use the STT transcript to predict whether the sentence is complete. Most accurate but adds latency.
Hybrid: Combine silence detection with semantic completeness checking. Best results in practice.

The ideal system adapts its turn-taking behavior to the conversation. During rapid exchanges, use shorter silence thresholds. During complex questions, wait longer.

"Latency is the make-or-break metric in voice AI. We've seen demos that looked impressive at 600ms fall apart in production at 1,200ms because nobody optimized the STT-to-LLM handoff. Get the pipeline right before you worry about voice selection or personality." - Ashit Vora, Captain at 1Raft

Voice Quality and Personality

The voice your agent uses defines its personality. Consider:

Voice selection: Choose a voice that matches your brand and use case. A healthcare appointment system needs a calm, reassuring voice. A sales agent needs an energetic, warm voice.

Pacing and prosody: Control speaking speed, pause duration, and emphasis. Slower for important information. Faster for routine confirmations. Pauses before key numbers.

Emotional range: Modern TTS engines support emotional modulation. The agent should sound empathetic during complaints, confident during problem resolution, and warm during greetings.

Consistency: Use the same voice across all interactions. Switching voices between calls breaks trust.

Production Considerations

Call Recording and Compliance

Record all calls with appropriate disclosure. Many jurisdictions require informing callers they're speaking with AI. Build the disclosure into the greeting: "Hi, this is an AI assistant from [Company]. How can I help you?"

Fallback to Human

Always provide a path to a human agent. "Let me connect you with a team member" should be triggered by: low confidence in understanding, user frustration signals, complex requests beyond the agent's scope, or explicit user request.

Telephony Integration

Voice agents connect to phone systems via SIP trunking or WebRTC. Providers like Twilio, Vonage, and Retell handle the telephony layer. Your AI agent handles the conversation logic while the telephony provider handles the transport.

Cost Per Call

STT: $0.005-0.02 per minute
LLM: $0.01-0.10 per call (depending on conversation length and model)
TTS: $0.01-0.05 per minute
Telephony: $0.01-0.05 per minute
Total: $0.05-0.25 per minute, or $0.25-1.50 for a typical 5-minute call

Compare to human agent cost of $1-3 per minute (fully loaded), and the economics are compelling for high-volume use cases. Gartner predicts conversational AI will reduce contact center agent labor costs by $80 billion by 2026 - and voice is the channel where the bulk of that shift happens. This is why AI customer service agents using voice are growing faster than text-only alternatives.

Cost Per Call: AI Voice Agent

Base scope

$0.25-$1.50

Typical 5-minute call

Total AI voice agent cost for a typical 5-minute call, compared to $5-$15 for a human agent.

STT (speech-to-text)

$0.005-$0.02/min

Audio transcription via Deepgram, Google, or Whisper

LLM inference

$0.01-$0.10/call

Varies by conversation length and model choice

TTS (text-to-speech)

$0.01-$0.05/min

Voice synthesis via ElevenLabs, Cartesia, or PlayHT

Telephony

$0.01-$0.05/min

SIP trunking or WebRTC via Twilio, Vonage, or Retell

Human agent cost: $1-$3 per minute fully loaded. AI voice agents deliver 5-10x cost savings at high call volumes.

Where Voice AI Works Today

High-confidence use cases:

Appointment scheduling and confirmation
Order status and tracking
Restaurant reservations
Payment reminders
Survey collection
After-hours call handling

Emerging use cases:

Technical support with troubleshooting flows
Sales qualification calls
Insurance claims intake
Healthcare symptom triage

Not ready yet:

Emotionally sensitive calls (collections, bad news delivery)
Complex negotiations
Unstructured conversations with no clear goal

Gartner predicts agentic AI will autonomously resolve 80% of common customer service issues without human intervention by 2029 - including voice interactions. The "not ready yet" list will shrink significantly over the next 24 months.

Voice AI is advancing fast. The latency and quality gaps are closing each quarter. For structured, high-volume conversations, it is production-ready today.

At 1Raft, we have built production voice agents handling thousands of daily calls across hospitality and fintech. The pattern that works: start with a structured use case (appointment booking, order status), nail sub-500ms latency, then expand scope. Our AI agent development team ships voice-capable agents in 12-week sprints, with latency optimization as a core engineering focus from day one.

Frequently asked questions

1Raft has built production voice agents handling thousands of daily calls with sub-500ms latency across fintech and hospitality. We handle the full pipeline: STT selection, LLM optimization, TTS integration, barge-in detection, and telephony. 100+ AI products shipped in 12-week sprints.

What We Learned Building Voice AI for Production

Most voice AI demos sound impressive - then fall apart at scale. Here's what actually matters when shipping AI phone agents that handle thousands of real calls.

Mar 1, 20268 min

Operations & Automation

Voice AI for Restaurants: How Phone Order AI Works

Red Lobster, Jet's Pizza, and Wendy's run voice AI across hundreds of locations. Here's the real accuracy rates, costs, and failure modes.

Mar 26, 202613 min read

Operations & Automation

Build a Banking Chatbot Customers Actually Use (Not Just Click Through)

Banks fielding 50,000+ routine inquiries monthly are using AI chatbots to resolve 80% of them without a human agent. Here's the architecture, the ROI math, and the compliance decisions that determine whether your deployment succeeds.

Jan 3, 20269 min

AI Voice Agents: When to Build and When to Skip

What Matters

Voice Agent Pipeline

The Voice Pipeline

Stage 1: Speech-to-Text (STT)

Stage 2: LLM Reasoning

Stage 3: Text-to-Speech (TTS)

Total Pipeline Latency

Pipeline Latency: Standard vs Optimized

Handling Interruptions (Barge-In)

Turn-Taking

Voice Quality and Personality

Production Considerations

Call Recording and Compliance

Fallback to Human

Telephony Integration

Cost Per Call

Cost Per Call: AI Voice Agent

Where Voice AI Works Today

Frequently asked questions

Why choose 1Raft for voice agent development?

How do AI voice agents work?

What latency do AI voice agents need?

What are the biggest challenges in building voice AI?

How much does it cost to run an AI voice agent?

What Is Agentic AI? Complete Guide

AI Customer Service Agents: Architecture and ROI

AI Agents for Business: Use Cases

Related posts

What We Learned Building Voice AI for Production

Voice AI for Restaurants: How Phone Order AI Works

Build a Banking Chatbot Customers Actually Use (Not Just Click Through)