Operations & Automation

AI Voice Agents: When to Build and When to Skip

By Ashit Vora7 min
a computer screen with a bunch of data on it - AI Voice Agents: When to Build and When to Skip

What Matters

  • -Production voice AI requires a streaming pipeline processing STT, LLM inference, and TTS simultaneously to achieve sub-500ms response times.
  • -The STT-to-TTS pipeline has three latency bottlenecks: speech-to-text transcription, LLM reasoning, and text-to-speech synthesis - each must be optimized independently.
  • -Users tolerate up to 800ms of silence before conversations feel unnatural; above that, completion rates drop significantly.
  • -Handling interruptions (barge-in detection) and turn-taking naturally are harder engineering problems than the core speech pipeline.

AI voice agents handle phone calls, voice commands, and spoken interactions autonomously. Unlike text-based AI agents that process typed input, voice agents add two demanding constraints: sub-second latency and natural-sounding speech. Get either wrong and the conversation feels broken.

TL;DR
AI voice agents use a three-stage pipeline: speech-to-text (STT), LLM reasoning, and text-to-speech (TTS). The total round-trip must stay under 800ms to feel natural. Current top-performing systems achieve 400-600ms. The biggest technical challenges are reducing latency at each stage, handling interruptions (barge-in), and managing turn-taking. Voice AI is production-ready for structured conversations (appointment booking, order status) but still struggles with open-ended, emotionally complex calls.

Voice Agent Pipeline

1
Stage 1: Speech-to-Text (STT)

Audio is captured and converted to text using streaming providers like Deepgram (100-200ms), Google Speech-to-Text, or Whisper.

100-300ms
2
Stage 2: LLM Reasoning

Transcribed text goes to the LLM, which generates a response based on conversation context, tools, and instructions. Streaming starts sending to TTS as first tokens arrive.

200-500ms
3
Stage 3: Text-to-Speech (TTS)

LLM response is converted to natural-sounding speech using ElevenLabs (highest quality), Cartesia (low latency), or PlayHT. Audio streams as sentences are generated.

100-200ms

The Voice Pipeline

Every voice agent follows the same three-stage pipeline:

Stage 1: Speech-to-Text (STT)

The user speaks. The audio is captured and converted to text. This stage takes 100-300ms depending on the provider and whether you use streaming.

Key providers: Deepgram (fastest, 100-200ms streaming), Google Speech-to-Text, Azure Speech Services, AssemblyAI, Whisper (open source, higher latency).

Optimization: Use streaming STT. Instead of waiting for the user to finish speaking, process audio in chunks as it arrives. Deepgram's streaming API starts returning partial transcripts within 100ms.

Stage 2: LLM Reasoning

The transcribed text goes to the LLM. The LLM generates a response based on the conversation context, tools, and instructions. This is the slowest stage - typically 200-500ms for the first tokens.

Optimization: Use streaming LLM responses. Start sending text to the TTS engine as soon as the first tokens arrive, not after the full response is generated. This overlaps Stage 2 and Stage 3.

Stage 3: Text-to-Speech (TTS)

The LLM's text response is converted to speech audio. Modern TTS engines produce natural-sounding speech with emotion and pacing.

Key providers: ElevenLabs (highest quality), Cartesia (low latency), PlayHT, Azure Neural Voices, Google WaveNet.

Optimization: Stream TTS output. Start playing audio from the first sentence while later sentences are still being generated.

Total Pipeline Latency

StageStandardOptimized (Streaming)
STT300-500ms100-200ms
LLM500-2000ms200-400ms (first tokens)
TTS200-500ms100-200ms (first audio)
Total1000-3000ms400-800ms

The optimized pipeline achieves latency comparable to human conversation pauses (300-700ms between turns).

400-800msOptimized voice pipeline latency

Down from 1,000-3,000ms with non-streaming approaches.

Pipeline Latency: Standard vs Optimized

STT
Standard (Non-Streaming)
300-500ms
Optimized (Streaming)
100-200ms
LLM
First tokens, not full response
Standard (Non-Streaming)
500-2,000ms
Optimized (Streaming)
200-400ms
TTS
First audio chunk
Standard (Non-Streaming)
200-500ms
Optimized (Streaming)
100-200ms
Total round-trip
Comparable to human conversation pauses
Standard (Non-Streaming)
1,000-3,000ms
Optimized (Streaming)
400-800ms

Handling Interruptions (Barge-In)

Humans interrupt each other constantly. A voice agent must handle interruptions gracefully:

  1. Detect the interruption: Monitor the audio input while the agent is speaking. When the user starts talking, the agent should stop.
  2. Stop playback: Immediately cease the current TTS output.
  3. Process the interruption: Send the new user speech through the STT pipeline.
  4. Discard unspoken text: If the LLM generated text that wasn't spoken yet, decide whether to discard it or save it for later.
  5. Respond to the new input: Generate a new response that acknowledges the context shift.
Key Insight
Barge-in detection is one of the hardest problems in voice AI. False positives (background noise triggering interruption) break the conversation flow. False negatives (missing a real interruption) make the agent seem unresponsive.

Turn-Taking

Knowing when the user has finished speaking is surprisingly difficult. Silence alone isn't a reliable signal - people pause mid-sentence to think.

Approaches to end-of-turn detection:

  • Silence duration: Wait for 500-700ms of silence. Simple but causes awkward pauses.
  • Prosodic cues: Detect falling pitch and slowing tempo that signal sentence endings. More natural but harder to implement.
  • Semantic analysis: Use the STT transcript to predict whether the sentence is complete. Most accurate but adds latency.
  • Hybrid: Combine silence detection with semantic completeness checking. Best results in practice.

The ideal system adapts its turn-taking behavior to the conversation. During rapid exchanges, use shorter silence thresholds. During complex questions, wait longer.

"Latency is the make-or-break metric in voice AI. We've seen demos that looked impressive at 600ms fall apart in production at 1,200ms because nobody optimized the STT-to-LLM handoff. Get the pipeline right before you worry about voice selection or personality." - Ashit Vora, Captain at 1Raft

Voice Quality and Personality

The voice your agent uses defines its personality. Consider:

Voice selection: Choose a voice that matches your brand and use case. A healthcare appointment system needs a calm, reassuring voice. A sales agent needs an energetic, warm voice.

Pacing and prosody: Control speaking speed, pause duration, and emphasis. Slower for important information. Faster for routine confirmations. Pauses before key numbers.

Emotional range: Modern TTS engines support emotional modulation. The agent should sound empathetic during complaints, confident during problem resolution, and warm during greetings.

Consistency: Use the same voice across all interactions. Switching voices between calls breaks trust.

Production Considerations

Call Recording and Compliance

Record all calls with appropriate disclosure. Many jurisdictions require informing callers they're speaking with AI. Build the disclosure into the greeting: "Hi, this is an AI assistant from [Company]. How can I help you?"

Fallback to Human

Always provide a path to a human agent. "Let me connect you with a team member" should be triggered by: low confidence in understanding, user frustration signals, complex requests beyond the agent's scope, or explicit user request.

Telephony Integration

Voice agents connect to phone systems via SIP trunking or WebRTC. Providers like Twilio, Vonage, and Retell handle the telephony layer. Your AI agent handles the conversation logic while the telephony provider handles the transport.

Cost Per Call

  • STT: $0.005-0.02 per minute
  • LLM: $0.01-0.10 per call (depending on conversation length and model)
  • TTS: $0.01-0.05 per minute
  • Telephony: $0.01-0.05 per minute
  • Total: $0.05-0.25 per minute, or $0.25-1.50 for a typical 5-minute call

Compare to human agent cost of $1-3 per minute (fully loaded), and the economics are compelling for high-volume use cases. Gartner predicts conversational AI will reduce contact center agent labor costs by $80 billion by 2026 - and voice is the channel where the bulk of that shift happens. This is why AI customer service agents using voice are growing faster than text-only alternatives.

Cost Per Call: AI Voice Agent

Base scope
$0.25-$1.50
Typical 5-minute call

Total AI voice agent cost for a typical 5-minute call, compared to $5-$15 for a human agent.

STT (speech-to-text)
$0.005-$0.02/min

Audio transcription via Deepgram, Google, or Whisper

LLM inference
$0.01-$0.10/call

Varies by conversation length and model choice

TTS (text-to-speech)
$0.01-$0.05/min

Voice synthesis via ElevenLabs, Cartesia, or PlayHT

Telephony
$0.01-$0.05/min

SIP trunking or WebRTC via Twilio, Vonage, or Retell

Human agent cost: $1-$3 per minute fully loaded. AI voice agents deliver 5-10x cost savings at high call volumes.

Where Voice AI Works Today

High-confidence use cases:

  • Appointment scheduling and confirmation
  • Order status and tracking
  • Restaurant reservations
  • Payment reminders
  • Survey collection
  • After-hours call handling

Emerging use cases:

  • Technical support with troubleshooting flows
  • Sales qualification calls
  • Insurance claims intake
  • Healthcare symptom triage

Not ready yet:

  • Emotionally sensitive calls (collections, bad news delivery)
  • Complex negotiations
  • Unstructured conversations with no clear goal

Gartner predicts agentic AI will autonomously resolve 80% of common customer service issues without human intervention by 2029 - including voice interactions. The "not ready yet" list will shrink significantly over the next 24 months.

Voice AI is advancing fast. The latency and quality gaps are closing each quarter. For structured, high-volume conversations, it is production-ready today.

At 1Raft, we have built production voice agents handling thousands of daily calls across hospitality and fintech. The pattern that works: start with a structured use case (appointment booking, order status), nail sub-500ms latency, then expand scope. Our AI agent development team ships voice-capable agents in 12-week sprints, with latency optimization as a core engineering focus from day one.

Frequently asked questions

1Raft has built production voice agents handling thousands of daily calls with sub-500ms latency across fintech and hospitality. We handle the full pipeline: STT selection, LLM optimization, TTS integration, barge-in detection, and telephony. 100+ AI products shipped in 12-week sprints.

Share this article