Operations & Automation

What We Learned Building Voice AI for Production

By Ashit Vora8 min
Worker scanning inventory in a large warehouse - What We Learned Building Voice AI for Production

What Matters

  • -Voice AI response times above 800ms feel unnatural and cause conversation derailment - aim for sub-500ms consistently.
  • -A streaming pipeline that processes speech-to-text, inference, and speech generation simultaneously is essential for production latency.
  • -Users speak differently to AI than to humans - they are more direct, less patient, and actively test boundaries.
  • -Monitor conversation quality metrics, not just completion rates, to catch degradation early.

Most voice AI demos sound great. A smooth conversation, a natural-sounding voice, a problem solved in 30 seconds. Then you put it in production. Real users call in. Background noise. Bad cell service. Accents the model hasn't heard. Interruptions. Angry customers who test every edge case you didn't plan for.

We've shipped voice AI phone agents that handle thousands of real calls every week. Here's what we learned about the gap between demo and production.

Voice AI response time zones

We measured this across thousands of calls. The pattern was consistent.

Green zone
Under 500ms

Conversations feel natural. Users engage like they're talking to a person.

Target for production
Requires streaming pipeline
Natural conversation rhythm
Yellow zone
500-800ms

Acceptable but noticeable. Users start to speak more slowly and deliberately.

Users adjust their behavior
Still functional
Noticeable pause
Red zone
Above 800ms

Users talk over the AI. Conversations derail. Completion rates drop by 30-40%.

30-40% drop in completion
Users assume system is broken
Conversation breaks down

The Latency Challenge

Voice is unforgiving. In a text chatbot, a 3-second delay is fine. The user reads, types, waits. With voice, anything above 800ms feels wrong. The conversation loses its rhythm.

We measured this across thousands of calls. The pattern was consistent:

  • Under 500ms: Conversations feel natural. Users engage like they're talking to a person.
  • 500-800ms: Acceptable but noticeable. Users start to speak more slowly and deliberately.
  • Above 800ms: Users talk over the AI. Conversations derail. Completion rates drop by 30-40%.
<500msTarget response time

Anything above 800ms causes users to talk over the AI and derails conversations.

The 500ms target includes everything: capturing the user's speech, converting it to text, running inference on the LLM, generating a response, and converting that response back to speech. All within half a second.

Most teams build this as a sequential pipeline. Audio comes in, gets transcribed, goes to the LLM, comes back as text, gets synthesized to speech. Each step waits for the previous one. Total latency: 2-3 seconds. Way too slow.

The Streaming Pipeline

The fix is parallelism. Instead of waiting for each step to finish, we overlap them.

Speech-to-text starts streaming partial transcripts the moment the user begins speaking. The LLM receives these partial transcripts and begins generating a response before the user finishes their sentence. Text-to-speech starts synthesizing audio from the first few words of the LLM's response while the rest is still being generated.

Think of it like a relay race where the next runner starts moving before the baton arrives. Each stage overlaps with the previous one.

Sequential vs streaming voice AI pipeline

Speech-to-text
Overlap starts here
Sequential pipeline
Waits for full utterance
Streaming pipeline
Streams partial transcripts live
LLM inference
Parallel processing
Sequential pipeline
Waits for full transcript
Streaming pipeline
Begins before user finishes
Text-to-speech
Like a relay race
Sequential pipeline
Waits for full LLM response
Streaming pipeline
Synthesizes from first words
Total latency
4-6x improvement
Sequential pipeline
2-3 seconds
Streaming pipeline
Under 500ms

Each stage overlaps with the previous one, like a relay race where the next runner starts moving before the baton arrives.

This architecture got us to consistent sub-500ms response times. But it introduced new problems.

Handling Interruptions

When the AI is mid-sentence and the user interrupts, you need to stop speech generation immediately. Not in 200ms. Now. Any delay and the AI talks over the user, which feels robotic and frustrating.

We built an interrupt detection system that monitors the audio input stream continuously. The moment it detects speech onset during AI output, it kills the current response and starts listening. The LLM discards its partial response and prepares for new input.

Graceful Fallbacks

Sometimes the LLM takes too long. Network latency spikes. The speech model stumbles on a difficult word. When this happens, you can't just leave dead air.

We built three fallback layers:

  1. Filler responses. If inference takes longer than 600ms, the system generates a brief acknowledgment - "Let me check on that" or "One moment" - while the real response finishes.
  2. Cached responses. For the 20 most common questions (which account for about 50% of calls), we pre-generate responses. No LLM round-trip needed.
  3. Graceful handoff. If the system can't produce a good response within 3 seconds, it transfers to a human agent with full conversation context. A bad AI answer is worse than a delayed human one.

The Human Factor

The biggest surprise wasn't technical - it was behavioral. Users speak differently to AI than to humans. They're more direct, less patient, and more likely to test boundaries.

We expected the technical challenges. Latency, audio quality, model accuracy. What caught us off guard was how differently people speak to an AI agent compared to a human one.

They're more direct. With a human agent, people add pleasantries. "Hi, how are you? I was wondering if you could help me with..." With AI, they get straight to the point. "What's my balance?" Three words. The agent needs to handle both styles.

They test boundaries. About 15% of callers deliberately test the AI - asking irrelevant questions, speaking gibberish, trying to confuse it. Your system needs to handle this gracefully without getting stuck in a loop.

They're less patient with silence. A human can say "hmm, let me look that up" and the caller waits. When AI goes silent for even 2 seconds, callers assume it's broken and either hang up or start talking over it.

They speak in fragments. People trail off, correct themselves, speak in half-sentences. Your speech-to-text model and LLM need to handle "I want to - actually, no, can you check my - what's the, uh, my last bill amount?"

How users talk to AI vs human agents

Directness
Agent must handle both styles
Talking to a human
Pleasantries, context-setting
Talking to AI
Straight to the point: 3 words
Patience with silence
Filler responses essential
Talking to a human
Tolerates 'let me look that up'
Talking to AI
2 seconds of silence = broken
Boundary testing
Plan for this
Talking to a human
Rare
Talking to AI
15% of callers test deliberately
Speech patterns
STT must handle messiness
Talking to a human
Full sentences, filler words
Talking to AI
Fragments, corrections, half-sentences

About 15% of callers deliberately test the AI - asking irrelevant questions, speaking gibberish, trying to confuse it.

Monitoring What Actually Matters

Most teams monitor completion rate: did the call end successfully? That's table stakes. It doesn't tell you if the conversation was actually good.

Here's what we track:

  • Turn count. How many back-and-forth exchanges did it take to resolve the issue? Lower is better. A spike means the agent is asking redundant questions.
  • Interruption rate. How often did the user talk over the AI? High rates signal latency problems or responses that don't match what the user expected.
  • Fallback frequency. How often does the system hit a filler response, use a cached answer, or hand off to a human? Trending up means the model needs retraining.
  • Sentiment drift. Does the caller's tone shift during the conversation? If they start neutral and end frustrated, something went wrong - even if the call "completed successfully."

We review a sample of 50 calls per week manually. Automated metrics catch the trends. Human review catches the subtle quality issues that metrics miss.

The Production Checklist

After shipping voice AI agents across customer service, recruiting, and appointment scheduling, here's what we'd tell any team building for production:

  1. Optimize for latency first, accuracy second. A fast wrong answer that leads to a quick correction is better than a slow right answer that makes the user hang up.
  2. Build your fallback system before your happy path. Users will find every edge case you missed. Plan for failure.
  3. Test with real users in the first week. Synthetic test calls don't capture the messiness of real conversations. Get real audio into your pipeline early.
  4. Monitor conversation quality, not just completion. A call that "completes" with a frustrated customer is a failure.
  5. Plan for the 15% who test boundaries. They're not bugs. They're a feature of any public-facing system.

Voice AI is one of the hardest AI products to get right. But when it works - when a customer calls and gets their problem solved in 45 seconds without waiting on hold - it's one of the most impactful.

Before committing to a voice AI build, run the numbers first. Our free Voice AI Cost Calculator estimates infrastructure costs and monthly spend based on your call volume and stack choices.

Frequently asked questions

Production voice AI should respond in under 500ms to feel natural. Anything above 800ms causes users to talk over the AI, derailing conversations and dropping completion rates. Achieving this requires a streaming pipeline that processes speech-to-text, inference, and speech generation simultaneously.

Share this article