What Matters
- -Production voice AI requires a streaming pipeline processing STT, LLM inference, and TTS simultaneously to achieve sub-500ms response times.
- -The STT-to-TTS pipeline has three latency bottlenecks: speech-to-text transcription, LLM reasoning, and text-to-speech synthesis - each must be optimized independently.
- -Users tolerate up to 800ms of silence before conversations feel unnatural; above that, completion rates drop significantly.
- -Handling interruptions (barge-in detection) and turn-taking naturally are harder engineering problems than the core speech pipeline.
AI voice agents handle phone calls, voice commands, and spoken interactions autonomously. Unlike text-based AI agents that process typed input, voice agents add two demanding constraints: sub-second latency and natural-sounding speech. Get either wrong and the conversation feels broken.
Voice Agent Pipeline
Audio is captured and converted to text using streaming providers like Deepgram (100-200ms), Google Speech-to-Text, or Whisper.
Transcribed text goes to the LLM, which generates a response based on conversation context, tools, and instructions. Streaming starts sending to TTS as first tokens arrive.
LLM response is converted to natural-sounding speech using ElevenLabs (highest quality), Cartesia (low latency), or PlayHT. Audio streams as sentences are generated.
The Voice Pipeline
Every voice agent follows the same three-stage pipeline:
Stage 1: Speech-to-Text (STT)
The user speaks. The audio is captured and converted to text. This stage takes 100-300ms depending on the provider and whether you use streaming.
Key providers: Deepgram (fastest, 100-200ms streaming), Google Speech-to-Text, Azure Speech Services, AssemblyAI, Whisper (open source, higher latency).
Optimization: Use streaming STT. Instead of waiting for the user to finish speaking, process audio in chunks as it arrives. Deepgram's streaming API starts returning partial transcripts within 100ms.
Stage 2: LLM Reasoning
The transcribed text goes to the LLM. The LLM generates a response based on the conversation context, tools, and instructions. This is the slowest stage - typically 200-500ms for the first tokens.
Optimization: Use streaming LLM responses. Start sending text to the TTS engine as soon as the first tokens arrive, not after the full response is generated. This overlaps Stage 2 and Stage 3.
Stage 3: Text-to-Speech (TTS)
The LLM's text response is converted to speech audio. Modern TTS engines produce natural-sounding speech with emotion and pacing.
Key providers: ElevenLabs (highest quality), Cartesia (low latency), PlayHT, Azure Neural Voices, Google WaveNet.
Optimization: Stream TTS output. Start playing audio from the first sentence while later sentences are still being generated.
Total Pipeline Latency
| Stage | Standard | Optimized (Streaming) |
|---|---|---|
| STT | 300-500ms | 100-200ms |
| LLM | 500-2000ms | 200-400ms (first tokens) |
| TTS | 200-500ms | 100-200ms (first audio) |
| Total | 1000-3000ms | 400-800ms |
The optimized pipeline achieves latency comparable to human conversation pauses (300-700ms between turns).
Down from 1,000-3,000ms with non-streaming approaches.
Pipeline Latency: Standard vs Optimized
| Metric | Standard (Non-Streaming) | Optimized (Streaming) |
|---|---|---|
STT | 300-500ms | 100-200ms |
LLM First tokens, not full response | 500-2,000ms | 200-400ms |
TTS First audio chunk | 200-500ms | 100-200ms |
Total round-trip Comparable to human conversation pauses | 1,000-3,000ms | 400-800ms |
Handling Interruptions (Barge-In)
Humans interrupt each other constantly. A voice agent must handle interruptions gracefully:
- Detect the interruption: Monitor the audio input while the agent is speaking. When the user starts talking, the agent should stop.
- Stop playback: Immediately cease the current TTS output.
- Process the interruption: Send the new user speech through the STT pipeline.
- Discard unspoken text: If the LLM generated text that wasn't spoken yet, decide whether to discard it or save it for later.
- Respond to the new input: Generate a new response that acknowledges the context shift.
Turn-Taking
Knowing when the user has finished speaking is surprisingly difficult. Silence alone isn't a reliable signal - people pause mid-sentence to think.
Approaches to end-of-turn detection:
- Silence duration: Wait for 500-700ms of silence. Simple but causes awkward pauses.
- Prosodic cues: Detect falling pitch and slowing tempo that signal sentence endings. More natural but harder to implement.
- Semantic analysis: Use the STT transcript to predict whether the sentence is complete. Most accurate but adds latency.
- Hybrid: Combine silence detection with semantic completeness checking. Best results in practice.
The ideal system adapts its turn-taking behavior to the conversation. During rapid exchanges, use shorter silence thresholds. During complex questions, wait longer.
"Latency is the make-or-break metric in voice AI. We've seen demos that looked impressive at 600ms fall apart in production at 1,200ms because nobody optimized the STT-to-LLM handoff. Get the pipeline right before you worry about voice selection or personality." - Ashit Vora, Captain at 1Raft
Voice Quality and Personality
The voice your agent uses defines its personality. Consider:
Voice selection: Choose a voice that matches your brand and use case. A healthcare appointment system needs a calm, reassuring voice. A sales agent needs an energetic, warm voice.
Pacing and prosody: Control speaking speed, pause duration, and emphasis. Slower for important information. Faster for routine confirmations. Pauses before key numbers.
Emotional range: Modern TTS engines support emotional modulation. The agent should sound empathetic during complaints, confident during problem resolution, and warm during greetings.
Consistency: Use the same voice across all interactions. Switching voices between calls breaks trust.
Production Considerations
Call Recording and Compliance
Record all calls with appropriate disclosure. Many jurisdictions require informing callers they're speaking with AI. Build the disclosure into the greeting: "Hi, this is an AI assistant from [Company]. How can I help you?"
Fallback to Human
Always provide a path to a human agent. "Let me connect you with a team member" should be triggered by: low confidence in understanding, user frustration signals, complex requests beyond the agent's scope, or explicit user request.
Telephony Integration
Voice agents connect to phone systems via SIP trunking or WebRTC. Providers like Twilio, Vonage, and Retell handle the telephony layer. Your AI agent handles the conversation logic while the telephony provider handles the transport.
Cost Per Call
- STT: $0.005-0.02 per minute
- LLM: $0.01-0.10 per call (depending on conversation length and model)
- TTS: $0.01-0.05 per minute
- Telephony: $0.01-0.05 per minute
- Total: $0.05-0.25 per minute, or $0.25-1.50 for a typical 5-minute call
Compare to human agent cost of $1-3 per minute (fully loaded), and the economics are compelling for high-volume use cases. Gartner predicts conversational AI will reduce contact center agent labor costs by $80 billion by 2026 - and voice is the channel where the bulk of that shift happens. This is why AI customer service agents using voice are growing faster than text-only alternatives.
Cost Per Call: AI Voice Agent
Total AI voice agent cost for a typical 5-minute call, compared to $5-$15 for a human agent.
Audio transcription via Deepgram, Google, or Whisper
Varies by conversation length and model choice
Voice synthesis via ElevenLabs, Cartesia, or PlayHT
SIP trunking or WebRTC via Twilio, Vonage, or Retell
Human agent cost: $1-$3 per minute fully loaded. AI voice agents deliver 5-10x cost savings at high call volumes.
Where Voice AI Works Today
High-confidence use cases:
- Appointment scheduling and confirmation
- Order status and tracking
- Restaurant reservations
- Payment reminders
- Survey collection
- After-hours call handling
Emerging use cases:
- Technical support with troubleshooting flows
- Sales qualification calls
- Insurance claims intake
- Healthcare symptom triage
Not ready yet:
- Emotionally sensitive calls (collections, bad news delivery)
- Complex negotiations
- Unstructured conversations with no clear goal
Gartner predicts agentic AI will autonomously resolve 80% of common customer service issues without human intervention by 2029 - including voice interactions. The "not ready yet" list will shrink significantly over the next 24 months.
Voice AI is advancing fast. The latency and quality gaps are closing each quarter. For structured, high-volume conversations, it is production-ready today.
At 1Raft, we have built production voice agents handling thousands of daily calls across hospitality and fintech. The pattern that works: start with a structured use case (appointment booking, order status), nail sub-500ms latency, then expand scope. Our AI agent development team ships voice-capable agents in 12-week sprints, with latency optimization as a core engineering focus from day one.
Frequently asked questions
1Raft has built production voice agents handling thousands of daily calls with sub-500ms latency across fintech and hospitality. We handle the full pipeline: STT selection, LLM optimization, TTS integration, barge-in detection, and telephony. 100+ AI products shipped in 12-week sprints.
Related Articles
What Is Agentic AI? Complete Guide
Read articleAI Customer Service Agents: Architecture and ROI
Read articleAI Agents for Business: Use Cases
Read articleFurther Reading
Related posts

What We Learned Building Voice AI for Production
Most voice AI demos sound impressive - then fall apart at scale. Here's what actually matters when shipping AI phone agents that handle thousands of real calls.

Voice AI for Restaurants: How Phone Order AI Works
Red Lobster, Jet's Pizza, and Wendy's run voice AI across hundreds of locations. Here's the real accuracy rates, costs, and failure modes.

Build a Banking Chatbot Customers Actually Use (Not Just Click Through)
Banks fielding 50,000+ routine inquiries monthly are using AI chatbots to resolve 80% of them without a human agent. Here's the architecture, the ROI math, and the compliance decisions that determine whether your deployment succeeds.
