If you've ever talked to a voice AI agent and felt like something was missing then odds are it was latency. That awkward pause between when you stop speaking and when AI responds can make or break the entire experience. Even pauses as short as 300 milliseconds feel unnatural and anything above 1.5 seconds your users are already checking out.
I was deep into this problem space while building github.com/saynaai/sayna and let me tell you that achieving Subsecond responsiveness is no trivial engineering challenge, it requires deep optimizations across the whole system, from how you handle audio streaming to your choice of STT, LLM and TTS providers.
The difference between a 300ms and 800ms response time can mean the difference between a natural, engaging conversation and a frustrating, robotic interaction that drives customers away.
The Voice Agent Latency Stack
Let's break down where your milliseconds actually go: A typical voice agent follows this pattern:
User speaks → STT → LLM → TTS → User hears response
Each component adds its own delay. Here is what a realistic latency breakdown looks like:
| Component | Typical Range | Target for sub-second |
|---|---|---|
| End of utterance detection | 100-300ms | 150-200ms |
| STT processing | 150-500ms | 100-200ms |
| LLM (TTFT) | 200-800ms | 150-300ms |
| TTS (TTFB) | 100-500ms | 80-150ms |
| Network overhead | 50-200 ms | 20-50 ms |
If you add these up, you're easily looking at 1-2+ seconds if you aren't careful. For sub-second response times you have to squeeze every component consistently.
Where Milliseconds Actually Die
After building voice systems in real-time, I've identified the most latency killers that most developers overlook:
1. End-of-Utterance Detection
This is probably the most overlooked component: how do you know when the user has finished speaking? Most naive implementations use silence timeouts, waiting 1-1.5 seconds silence before processing - that's already half your latency budget gone before you even start!
Smart end-of-speech (EOU) detection using ML models can cut this to 150-200ms with tools such as the TurnDetector or Silero VAD, which can detect when someone has finished a sentence vs just a breath.
2. Sequential vs Streaming Processing
The traditional approach processes each step sequentially:
User stops speaking
↓ Wait for silence timeout: 1500ms
↓ STT processes complete audio: 600ms
↓ LLM generates complete response: 3000ms
↓ TTS converts complete response: 1000ms
↓ User starts hearing response
Total: 6+ seconds Terrible.
The streaming approach changes everything:
User stops speaking
↓ Smart EOU detection: 200ms
↓ STT streams first words: 300ms TTFB
↓ LLM starts generating: 400ms TTFT
↓ TTS starts speaking: 300ms TTFB
↓ User starts hearing response
While the LLM is generating token 50, the TTS has already spoken tokens 1-30 and the user is already hearing tokens 1-10. This parallel processing dramatically reduces perceived latency.
3. Provider Selection Matters More Than You Think
Here's something I learned the hard way: not all providers are created equal and the differences are massive.
STT providers differ significantly in streaming capabilities - Deepgram Nova-3 achieves a sub-300 ms latency with competitive accuracy, AssemblyAI's Universal-2 offers strong accuracy but may have different latency characteristics. The key is to find providers that support true streaming with interim results.
TTS providers have even more variance: Cartesia's Turbo mode can achieve 40ms TTFB, ElevenLabs Flash delivers 75ms delay, and Deepgram's Aura-2 targets 200ms TTFB. These differences quickly compound in real conversations.
LLM Time-to-First-Token (TTFT) is your bottleneck. Claude, GPT-4 and other frontier models are powerful but can reach 400-800ms TTFT. Smaller models like Groq's offerings can hit 100-200ms TTFT, but with different capacity tradeoffs.
The reality is that you can't optimize everything with a single provider - different use cases require different provider combinations.
Why Multi-Provider Architecture is Essential
This is where I get opinionated - if you're building a production voice agent and locking yourself into a single provider stack, you're creating yourself for pain.
Latency variation is real. Providers have good and bad days: A provider that normally gives you 150ms might spike up to 800ms during peak load: having fallback options isn't just nice to have; it's essential for production reliability.
Geographic distribution matters. Your inference provider might only support Europe/US regions - if you're in Australia that is an extra 200-300ms round trip. Being able to route to different providers based on the location is a game changer.
Quality/latency tradeoffs differ by context. A simple "yes/no" response doesn't need the most sophisticated TTS voice: a complex explanation might warrant the extra 100ms for better prosody: Provider abstraction allows you to make these decisions dynamically.
Costs vary wildly. TTS pricing ranges from $0.01 to $0.30+ per minute based on the provider and quality tier. Being able to route on cost versus quality requirements saves significant money at scale.
This is exactly why we built github.com/saynaai/sayna with pluggable provider architecture from the first day - Configure multiple STT and TTS providers and the system can route based on latency requirements, quality needs or cost constraints - no vendor lock-in, no single point of failure.
The Streaming Architecture That Actually Works
Here is the architecture pattern that consistently delivers sub-second responses after a lot of trial and error:
WebRTC for Transport
Traditional telephony adds 100-200ms fixed latency from just the network stack. WebRTC with globally distributed infrastructure (like LiveKit) can reduce time to 20-50ms for audio transport - this is a massive win.
Streaming All The Way Down
Every component needs to support streaming:
- STT should send interim results every 50-100ms when it transcribes
- LLM must stream tokens, not wait for complete generation
- TTS needs to start audio synthesis from partial text, not wait for full sentences
The challenge is coordinating all these streams. Most TTS providers need at least a sentence boundary to produce good prosody. You need smart buffering that accumulates enough text for quality synthesis without adding noticeable delay.
Reuse Connection and Keep-Alive
Any HTTP connection setup adds latency, every DNS lookup adds latency. For voice agents you want to take:
- Persistent WebSocket connections to your STT provider
- gRPC streaming where available (much lower overhead than REST)
- Connection pooling for TTS requests
- DNS caching to avoid lookups in the critical path
Co-location
This may sound obvious, but most people get it wrong. Your STT, LLM and TTS should be all in the same region, ideally the same VPC. Cross-region calls add 50-100ms each way. If you run three cross-region calls per utterance, that is 300-600ms in network latency only.
Measuring What Matters
There is no way to optimize what you don't measure. Here are the metrics that actually matter for voice agent latency:
Time-to-First-Byte (TTFB) for each component: this tells when each service starts responding and not when it finishes.
P95/P99 latency, not medians: voice agents need constant performance - a 200ms median with 2s P99 will feel terrible to users hitting the tail.
End-to-End Latency from User Silence to First Audio: This is what users actually experience.
Latency distribution across geographical regions: Your US users might be happy while your APAC users are suffering.
The Hedge Strategy
Here is an advanced technique that can significantly improve tail latency: hedging. Launch parallel requests to multiple LLM providers and use whichever returns first.
This sounds costly, but it usually is worth the extra cost for the critical route of a voice conversation. Cancellation of a slower request after the faster one starts returning, minimizing waste.
The same principle applies to TTS: If you have configured multiple providers, racing them for the first response can dramatically reduce your P99 latency.
What's Next for Voice AI Latency
The industry is moving fast, with speech models like GPT-4o Realtime that promise 200-300ms of end-to-end latency by eliminating the STT-LLM-TTS pipeline entirely, but come with tradeoffs: less control, higher cost and sometimes worse handling of enterprise requirements like custom vocabularies.
My bet is that we will see a hybrid future: Speech-to-Speech for simple interactions, cascaded pipelines for complex use cases that require fine control. Platforms that let you choose between approaches will win.
The bottleneck is no longer the models, it is the orchestration: How you coordinate routing, streaming, state management and failover across components determines your real-world performance.
Wrapping Up
Achieving a sub-second voice agent latency is absolutely possible, but it requires intentional architecture decisions:
- Stream everything and process in parallel
- Select providers wisely based on your latency budget
- Don't lock yourself to single vendors - Provider abstraction is essential
- Measure end-to-end, not just individual components
- Optimize the network path with co-location and connection reuse
From Sayna we've built this entire infrastructure layer so that you can focus on your agent logic rather than deal with latency optimization. The voice layer handles provider routing, streaming orchestration and all the complexity described above.
If you're building voice-first AI applications and your latency is keeping you up at night, I'd love to hear how you solve it. The voice AI space is moving incredibly fast and there's always more to learn.
Don't forget to and share this if you found it helpful!
Top comments (0)