Introduction #
In the previous post I mapped the abstract anatomy of a real-time voice agent — STT, LLM, TTS, VAD, orchestrator, RAG, and the latency budget that ties them together. That post was the blueprint. This one is the construction site.
Every code snippet, every latency number, every architectural choice in this piece is pulled directly out of a working voice-agent platform I built — the same one I shipped as Velox AI. So instead of whiteboard sketches, you're getting the production answers: how the pieces are actually wired, what surprised me, what I'd reverse if I started over, and where the milliseconds really go.
A concrete blueprint you can re-implement on your own stack. Every vendor I picked is named alongside 2–3 drop-in alternatives, because the architecture is what matters — the providers are just plug-ins.
The Conversation Lifecycle (Live View)
Before we go anywhere, here's what we're optimising for. One full conversational turn — from the moment a user finishes speaking to the moment they hear the agent reply — passes through nine stages and crosses three networks. Every millisecond on this rail is a millisecond the user can hear:
UtteranceEndfinal transcript readyRoughly 850 ms end-to-end. Sounds slow on paper. Feels almost-but-not-quite-human in practice — and most of the engineering that follows is about chipping away at that number without losing audio quality or context.
One Pipeline, Many Vendors #
Before any code: the single most important architectural decision you'll make is whether your pipeline is vendor-locked or vendor-agnostic. Every voice-agent tutorial on the internet hard-codes one stack — "OpenAI Whisper + GPT-4 + ElevenLabs" — and ships it. That works for a demo. It does not work in production.
The reason: the optimal stack depends on language, latency budget, voice taste, and price. An English customer-support bot in the US wants Deepgram + Groq + Deepgram Aura. A Hindi receptionist wants Sarvam everywhere. A premium character-voice game NPC wants ElevenLabs. A privacy-paranoid customer wants self-hosted Piper. Lock yourself into one stack and you've locked yourself out of those use cases.
The fix is an adapter pattern — one interface per layer, many implementations behind it, chosen per agent at config time:
# Every STT provider implements the same protocol.
class STTAdapter(Protocol):
async def send_audio(self, pcm: bytes) -> None: ...
async def events(self) -> AsyncIterator[STTEvent]: ...
async def close(self) -> None: ...
# Picking is a one-liner at call setup:
stt = build_stt(agent.stt_provider, language=agent.stt_language)
# └─► returns DeepgramAdapter | SarvamAdapter | AssemblyAIAdapter | …
# The orchestrator never imports a specific vendor.
async for event in stt.events():
handle(event)
Three layers, three adapters. STT, LLM, TTS. Swap any one without touching the other two, the orchestrator, the audio plumbing, or the UI. The block below shows what I shipped and what you can plug in instead — the section after that goes through each in detail.
The provider-agnostic design isn't just future-proofing — it's the only honest answer to the question "which is the best STT/LLM/TTS?" The answer is it depends, and your architecture has to make that answer cheap to act on.
System Architecture #
Zoom out. Here's the whole system on one diagram — browser, dashboard backend, voice runtime, persistence, and every external provider it talks to:
┌────────────────────────────────────────────────────────────────────┐
│ BROWSER (Next.js 15) │
│ Dashboard UI useVoiceAgent() hook AudioWorklet │
│ (Agent Builder) ◄──► (WebSocket + control msgs) ◄►(48k→16k PCM) │
└──────┬────────────────────────┬────────────────────────┬───────────┘
│ HTTPS │ WebSocket │ Web Audio API
│ (NextAuth, BFF) │ (binary PCM + JSON) │ (gapless playback)
▼ ▼
┌─────────────────────┐ ┌─────────────────────────────────────────┐
│ Next.js API routes │ │ FastAPI BACKEND (Python) │
│ /api/agents/* │ │ ┌───────────────────────────────────┐ │
│ /api/auth/* │ │ │ WebSocket handler /ws/agent/{id}│ │
│ /api/rag (proxy) │ │ └───────────────┬───────────────────┘ │
└──────┬──────────────┘ │ │ │
│ │ ┌───────────────▼───────────────────┐ │
▼ │ │ llm_orchestrator (per-call task) │ │
┌─────────────────────┐ │ │ • TaskManager (interrupt signal) │ │
│ MongoDB Atlas │◄─┼──┤ • Sentence-buffered LLM → TTS │ │
│ users / agents / │ │ │ • Tool dispatch │ │
│ calls (recordings) │ │ └──┬──────────┬──────────┬──────────┘ │
└─────────────────────┘ │ ▼ ▼ ▼ │
┌─────────────────────┐ │ ┌──────┐ ┌──────┐ ┌──────┐ │
│ Redis │◄─┼──┤ STT │ │ LLM │ │ TTS │ │
│ • agent cfg cache │ │ │ adptr│ │ adptr│ │ adptr│ │
│ • rate-limit ctrs │ │ └──────┘ └──────┘ └──────┘ │
└─────────────────────┘ │ │ via WS │ HTTP │ HTTP │
┌─────────────────────┐ │ ▼ ▼ ▼ │
│ Qdrant Cloud │◄─┼──── External providers (Deepgram, Groq, ─┤
│ agent_knowledge │ │ Sarvam, Cerebras, NVIDIA, ElevenLabs)│
│ 384-dim cosine │ │ │
└─────────────────────┘ └──────────────────────────────────────────┘
Control plane vs data plane
The clearest mental model for a voice-agent platform is the same one cloud infra teams use: control plane vs data plane.
- Control plane = the dashboard. Where agents are defined. A user signs in, edits an agent (system prompt, provider choices, voice, knowledge base), and the result lands in MongoDB. Pure CRUD, no real-time anything, no exotic infrastructure.
- Data plane = the voice runtime. Where agents run. A WebSocket opens, the backend reads the agent config from Mongo (cached in Redis), spins up an asyncio task graph (STT loop, LLM stream, TTS worker, audio sender), and streams audio in both directions until the user hangs up.
The two share exactly one thing: the agent document. They can be deployed together (one box, two containers, which is what I actually run today on AWS) or scaled independently the moment one of them gets hot. Because the data plane holds no cross-call in-process state — everything per-call lives on a single asyncio task graph — you can horizontally scale it by adding more boxes behind a load balancer the day you need to. That decoupling is worth more than any specific tech choice.
Modular monolith, not microservices. Single FastAPI process, but cleanly partitioned modules (services/stt/*, services/tts/*, state/, tools/). In-process function calls beat any RPC for latency, and the module boundaries are already shaped so that extracting a worker pool later would be a refactor, not a rewrite.
The Tech Stack #
Concrete choices, with the "why" in one line and the alternatives you can swap in on the right. None of these is the One True Choice — they're the trade-offs that fit a real-time, B2B, multi-tenant SaaS.
| Layer | My pick | Why | Alternatives |
|---|---|---|---|
| Frontend | Next.js 15 + React 19 + Tailwind | App Router + RSC for the dashboard; Framer Motion for the UI animations. | Remix, Astro, SvelteKit, vanilla Vite + React |
| Dashboard API | Next.js API routes | Co-located BFF. No second service for CRUD. | tRPC, Express, Hono, NestJS |
| Voice runtime | Python 3 + FastAPI + asyncio | Every AI SDK ships a Python client first; asyncio handles hundreds of IO-bound tasks per call. | Node.js + uWebSockets, Go + Gorilla WS, Elixir Phoenix |
| Primary DB | MongoDB Atlas | Flexible nested agent docs. (Honest: I'd pick Postgres + JSONB if I started over.) | Postgres + JSONB, Supabase, PlanetScale, DynamoDB |
| Vector DB | Qdrant Cloud | Multi-tenant via payload filter; cloud-hosted means I don't run it. | Pinecone, Weaviate, Milvus, pgvector, Chroma |
| Embeddings | all-MiniLM-L6-v2 (local) | Free, sub-50 ms, 384-dim, runs on CPU. Good enough for SMB-sized KBs. | OpenAI text-embedding-3-small, Voyage, BGE, Cohere |
| Cache / KV | Redis | Agent config cache + per-IP rate limit. Not used as a queue (everything in-process is asyncio). | Dragonfly, KeyDB, Memcached, in-memory LRU |
| Auth | NextAuth.js 5 | Handles Google/GitHub OAuth + credentials; admin runs a separate token scheme. | Clerk, Supabase Auth, Auth0, custom JWT |
| Storage | Cloudinary | Call recordings (WAV) + KB source files. Public-CDN URLs out of the box. | S3, Cloudflare R2, Backblaze B2, GCS |
| Hosting | AWS EC2 + Docker + Nginx | One box in ap-south-1 (Mumbai), two containers, TLS via Let's Encrypt. Boring on purpose. | Fly.io, Render, Railway, Hetzner, K8s |
MongoDB. The "flexible schema" benefit didn't pay off — every agent ends up with the same fields, and the relational bits (user → agents, agent → calls, agent → KB files) want a relational store. Postgres + JSONB would have been easier to query, easier to migrate, and cheaper to operate. If you're starting fresh: start there.
The Real-Time Pipeline #
Audio transport: raw WebSocket, not WebRTC
Most tutorials reach for WebRTC because it's "the real-time standard." For a browser-to-server voice agent you almost certainly don't need it. WebRTC buys you three things: NAT traversal, jitter buffering, and AEC. None of those are valuable here:
- NAT traversal — you have a public server. No peer-to-peer hole-punching.
- Jitter buffering — you're going to build your own client-side ~150 ms buffer anyway (TTS chunks arrive bursty).
- AEC —
getUserMedia({ echoCancellation: true })already does it on the browser side.
Skipping WebRTC means no SDP exchange, no STUN/TURN servers, no SFU. Just one TCP+TLS connection carrying binary WebSocket frames in both directions. Lower setup latency, far less infra. The audio format on the wire:
| Hop | Format | Rate |
|---|---|---|
| Mic → AudioWorklet | Float32 mono | 48 kHz |
| AudioWorklet → WebSocket | Int16 PCM (Linear16) | 16 kHz |
| Server → STT | Same Int16 | 16 kHz |
| TTS → server → client | Raw PCM bytes | 16–24 kHz (provider-dependent) |
The downsample 48 kHz Float32 → 16 kHz Int16 is the single most-important client-side detail. Doing it in the AudioWorklet (not the main thread) keeps the audio path off React's render loop — a React re-render blocking the audio thread is what causes mysterious glitches in toy implementations.
The orchestrator loop
Everything above is plumbing. The brain is one async function that owns the call's state machine: it consumes STT events, fires the LLM, slices its token stream into sentences, queues those sentences for TTS, and pushes audio bytes back over the WebSocket — while listening for interruptions the whole time.
# Per-call task graph (simplified from services/llm_orchestrator.py)
task_manager = TaskManager(websocket) # audio queue, interrupt signal, llm task handle
history = [{"role": "system", "content": agent.system_prompt}]
stt = build_stt(agent.stt_provider, language=agent.stt_language)
# 1) STT loop — listens for transcripts & barge-in
async def stt_loop():
async for event in stt.events():
if event.type == "interim_transcript" and task_manager.is_busy:
await task_manager.handle_interruption() # ← see Barge-In section
elif event.type == "utterance_end":
asyncio.create_task(run_llm_and_tts(event.final_text))
# 2) The actual pipeline: LLM → sentence buffer → TTS queue
async def run_llm_and_tts(user_text):
history.append({"role": "user", "content": user_text})
sentence_buffer = ""
async for delta in stream_llm(history[-25:], tools=TOOL_SCHEMAS):
if delta.tool_call: # tool calls handled separately
await dispatch_tool(delta.tool_call); return
sentence_buffer += delta.content
for sentence in extract_sentences(sentence_buffer): # punctuation-boundary slice
if task_manager.interrupt_signal.is_set(): # bail on barge-in
return
await task_manager.tts_queue.put(sentence)
sentence_buffer = leftover_after_extraction
if sentence_buffer.strip():
await task_manager.tts_queue.put(sentence_buffer) # flush tail (don't drop it!)
# 3) TTS worker — drains the sentence queue
async def tts_worker():
while True:
sentence = await task_manager.tts_queue.get()
async for chunk in tts.synthesize(sentence):
if task_manager.interrupt_signal.is_set():
break
await task_manager.audio_queue.put(chunk)
# 4) Audio sender — drains audio queue to the client
async def audio_sender():
while True:
chunk = await task_manager.audio_queue.get()
await websocket.send_bytes(chunk)
await asyncio.gather(stt_loop(), tts_worker(), audio_sender())
Four concurrent tasks. Three queues between them. One shared TaskManager holding the interrupt signal
that every task checks before doing anything. That's the entire architecture — everything else in this post is detail
about how each of those four tasks behaves.
Sentence Buffering — The Magic Trick #
Why does Velox feel snappy when most voice bots feel laggy? Sentence buffering.
A naive pipeline waits for the LLM to finish generating, sends the whole response to TTS, waits for TTS to synthesise the whole thing, then plays it. That's three serial round-trips — users sit through ~2–3 seconds of dead air every turn. The fix is to chunk the LLM token stream by punctuation boundaries and pipe each completed sentence to TTS the instant it lands:
# The whole magic, in one regex. SENTENCE_BOUNDARY = re.compile(r"([.!?:;])\s+|\n") # Stream tokens in; whenever a terminator is seen, slice off the completed # sentence and ship it to TTS. Everything after the cut stays in the buffer.
Three sub-decisions matter here:
- Boundary set.
.,!,?,:,;,\n. Anything finer (token count, mid-clause) produces glitchy half-words from TTS. Anything coarser (whole paragraphs) starves the user of audio. - Hold the tail. If the LLM ends mid-sentence (no terminator), the last buffered chunk has to be flushed at end-of-stream — otherwise the final clause silently disappears. (Ask me how I know.)
- Drop too-short fragments. Sentences shorter than ~3 chars are held back. Otherwise an "OK." costs a full TTS round-trip just to say one syllable.
Net effect: the user hears the first sentence ~600 ms after they stop speaking, not ~2.5 s. TTS for sentence 2 happens while sentence 1 is being played. The whole pipeline is naturally pipelined.
Barge-In, Implemented #
Barge-in — the user talking over the agent mid-sentence — was the bug class that bit the most before it was solved. The hardest part isn't detecting that the user spoke; it's making sure the agent shuts up within ~100 ms, across three concurrent tasks, two queues, and one TCP stream worth of buffered audio.
Two engineering choices made it tractable:
1. Detect barge-in semantically, not acoustically. A naive implementation watches mic energy — anything above a threshold counts as the user speaking. In real-world audio that fires on keyboard clicks, doors, AC fans, the user's own breathing. Velox uses the STT engine's interim transcripts instead: if Deepgram returns a partial that contains a recognisable word while the agent is speaking, that's a real interrupt. Otherwise it's noise — ignore it.
2. Treat interruption as a cascade, not a flag. When the interrupt fires, six different things have to happen in order. Get any of them wrong and you hear overlapping voices, ghost audio, or the agent picking up where it left off after a 2-second pause:
task_manager.is_busy == true.
interrupt_signal.set() — every loop checks this flag before doing anything.
CancelledError and unwinds.
{"type":"control","action":"interrupt"} message goes over the WebSocket. The client clears its playback queue and stops every AudioBufferSourceNode currently scheduled.
was_interrupted = True is flagged on the call state — the next LLM call will see [User interrupted you] injected into history so the model recovers gracefully.
All of the above assumes the user's mic isn't capturing the agent's own speaker output and triggering false interrupts. The browser's built-in WebRTC AEC (via getUserMedia({ echoCancellation: true })) handles this for free in-browser. On the server side, you'd need reference-signal subtraction — subtract the TTS output from the inbound mic stream before sending it to STT — but for browser-first apps you genuinely don't have to write that code.
The known imperfection: when the user interrupts mid-sentence, the conversation history records the full intended response, not "what the user actually heard." If the user references something the agent never finished saying, the model can get confused. The fix is to truncate the history to the last word the user heard, which requires knowing which audio chunks actually made it past the jitter buffer. Not solved yet.
Knowledge Base & RAG #
A voice agent without business knowledge can't answer "what's your return policy?" It needs RAG — retrieval-augmented generation. The pipeline is dead simple but every parameter is a trade-off:
Drag-drop in dashboard
↓ multipart upload
Parse: PDF → PyPDF | DOCX → python-docx | TXT → utf-8 | Image → OCR
↓
Chunk: LangChain RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
↓
Embed: SentenceTransformer("all-MiniLM-L6-v2") → 384-dim vectors
↓
Upsert: Qdrant collection="agent_knowledge"
payload = { agent_id, file_id, filename, content, source, chunk_index }
Choices worth defending:
- 500 char / 50 overlap chunks, character-based, not token-based. Small enough to fit 3-5 in an LLM call without bloating cost; large enough to carry real context. The recursive splitter respects paragraph → sentence → word boundaries before falling back to mid-word.
-
Local embeddings (all-MiniLM-L6-v2) over OpenAI's
text-embedding-3-small. 384-dim is plenty for SMB-sized KBs (~50–500 chunks per agent), inference is sub-50 ms on CPU, and there's no per-token bill or rate limit on the hot path. The recall gap vs. OpenAI is not noticeable below a few thousand chunks. For technical / legal / multilingual KBs, swap in something stronger. Alternatives: OpenAI text-embedding-3, Voyage, Cohere, BGE. -
One Qdrant collection, payload-filter isolation. Every point has an
agent_idfield with a payload index on it; every query filters by it. Cheap, simple, scales to thousands of agents. If a customer ever demands physical separation, the migration is "one collection per tenant" — a refactor, not a rewrite. - Top-3 vector-only retrieval. No hybrid keyword search, no re-ranker. Top-3 with 500-char chunks puts ~1500 chars of context in the LLM — concise factual answers that read naturally when spoken aloud.
# Tenant isolation = one line of Qdrant filter
query_filter = models.Filter(must=[
models.FieldCondition(
key="agent_id",
match=models.MatchValue(value=agent_id),
)
])
results = qdrant.search(
collection_name="agent_knowledge",
query_vector=embed(user_question), # 384-dim
query_filter=query_filter,
limit=3,
)
The retrieved chunks aren't stuffed into the system prompt — they're injected as the response of a
search_knowledge_base tool call. The LLM decides when to ask for them. That's
crucial: it means the model can choose not to search when the user's just exchanging pleasantries,
which saves a Qdrant round-trip every turn. Which leads us to…
Tool Calls Inside a Streaming Loop #
Tool calling on a non-streaming HTTP chat endpoint is easy. Tool calling inside a streaming voice pipeline is genuinely tricky, because tool calls arrive as partial JSON interleaved with regular content tokens, and you have to make decisions on the fly:
- Suppress any "narration" tokens the model emits alongside the tool call (it shouldn't say "I'm going to look that up…" out loud while it's already doing it).
- Mask the tool latency with a filler phrase so the user hears something.
- Execute the tool, append the result to history, then re-call the LLM, which streams the actual spoken answer back through the sentence buffer.
async def run_llm_and_tts(user_text):
history.append({"role": "user", "content": user_text})
sentence_buffer = ""
tool_call_buffer = None
async for delta in stream_llm(history[-25:], tools=TOOL_SCHEMAS):
# 1) Accumulate partial tool-call JSON across stream chunks
if delta.tool_call:
tool_call_buffer = accumulate(tool_call_buffer, delta.tool_call)
continue # don't TTS this delta
# 2) Regular content tokens flow into the sentence buffer
sentence_buffer += delta.content
for sentence in extract_sentences(sentence_buffer):
if task_manager.interrupt_signal.is_set(): return
await task_manager.tts_queue.put(sentence)
sentence_buffer = leftover_after_extraction
# 3) End of stream — if a tool was called, execute and recurse
if tool_call_buffer:
# Latency mask: queue a filler while the tool runs
await task_manager.tts_queue.put(random.choice(FILLER_PHRASES))
result = await execute_tool(tool_call_buffer) # search KB, end call, etc.
history.append({"role": "tool", "content": result,
"tool_call_id": tool_call_buffer.id})
await run_llm_and_tts(_continuation=True) # second LLM call answers
elif sentence_buffer.strip():
await task_manager.tts_queue.put(sentence_buffer) # flush tail
The filler phrase ("Let me check the knowledge base for that…") masks ~400–700 ms of dead air on knowledge searches. It's a small thing, but it's the difference between an agent that feels thoughtful and one that feels stuck. The same trick applies to slow LLMs in general — pick a generic "thinking" phrase, queue it before the LLM call, and the user hears engagement instead of silence.
Today: search_knowledge_base(query) and end_call(reason). The obvious next additions for a
customer-care use case: transfer_to_human, send_sms, send_email,
schedule_callback, book_appointment, lookup_order(id), verify_identity(otp).
For user-defined tools, the design that's clearly right is per-agent encrypted credentials in the DB plus
server-side execution — credentials never reach the client. Not built yet, but the adapter shape is there.
Latency in Practice #
Real numbers from a production pipeline on a good connection (English path, Deepgram STT + NVIDIA NIM LLM + Deepgram Aura TTS, client in ap-south-1). These are estimates from log-stamped TTS first-byte events — honest disclaimer: there's no P50/P95 dashboard yet, which is genuinely the biggest operational gap left in the system.
Two observations worth taking away:
- The LLM is rarely the bottleneck anymore. Going in, I assumed "AI is slow" and built around hiding model latency. With Groq / Cerebras / NVIDIA NIM returning first tokens in 150–250 ms, the LLM is faster than the network round-trip to TTS. The real bottlenecks are (a) STT endpointing waiting for silence and (b) TTS providers that don't stream.
- Users notice silence after they speak, not delay before they're answered. 200 ms of silence between "Hi" and the agent reacting feels broken. 500 ms of audio delay once the agent is talking feels totally fine. That asymmetry should shape where you spend your optimisation budget.
Not a latency win — a perception win: switching from energy-based VAD to transcript-based barge-in. Before: any background sound (typing, doors, fans) interrupted the agent. After: only real speech does. Users stopped reporting that the AI "sounds skittish." Sometimes the biggest performance gain isn't a faster path — it's removing a path that fires when it shouldn't.
What's left to attack: Sarvam's TTS doesn't stream — it returns full utterances in one HTTP response.
For a typical one-sentence reply that's 300–600 ms of dead air on Indian-language agents. The day Sarvam ships a
streaming endpoint, that becomes a one-line config change for a 30% latency win. STT endpointing
(utterance_end_ms=1000) is the runner-up — dropping it to 500 ms shaves half a second off perceived
latency at the cost of more occasional early triggers. It's a tunable knob, not a code change.
What Surprised Me #
Three things I didn't expect going in. Worth sharing because every voice-agent post on the internet was wrong about at least one of them:
- The LLM is the fastest part. See above. Most blog posts treat the LLM as the slow expensive thing you wrap with caching and prompt compression. With modern inference providers, it's faster than your network round-trip. Spend your optimisation budget on STT endpointing and TTS streaming — that's where the seconds live.
- Semantic caching isn't worth it for voice. Theory: 40% of customer-support queries are repeats; cache them. Reality: voice queries are wildly variable in phrasing ("how do I…" / "can you tell me…" / "what's the way to…" / "I want to know about…"), cache hit rates are low, and the operational complexity isn't worth the win when the uncached path is already ~700 ms.
- You need fewer queues than you think. Early prototypes had a queue between every stage of the pipeline. All you actually need is a sentence queue (LLM → TTS) and an audio queue (TTS → client). Everything else is in-process async/await — queues add latency and add a place for state to drift out of sync with the interrupt signal.
Want to talk to the thing in this post?
Everything above is live at Velox AI — the platform I built that lets businesses configure their own voice agents with this exact pipeline. Try the public demo, build your own agent, or browse the public library.
Open Velox AI