Case study
Enterprise voice AI platform
Real-time multi-tenant voice agent platform handling enterprise customer workflows with sub-second latency, full observability, and GDPR-grade isolation.
Headline metric
< 1s
response latency
Stack
- LiveKit
- Azure AI Foundry
- TypeScript
- Python
- FastAPI
The problem
A London-based enterprise needed a voice agent platform that could handle real customer workflows, not just demo-quality voice, with three constraints most off-the-shelf solutions fail on: GDPR-grade data isolation across tenants, sub-second latency on the response loop, and the ability to plug into an existing CRM and telephony stack without rewriting either.
The approach
I architected the backend from scratch with the founder, then led the build. The core is a Python orchestration engine on LiveKit that handles the realtime media, with Azure AI Foundry powering the LLM layer and a RAG retrieval system over per-tenant knowledge bases. Telephony, CRM, and tool-calling are wired into the same agent loop so context flows in one direction across a call.
Multi-tenancy is enforced at the data, runtime, and deployment layers. Each tenant gets isolated retrieval indexes, runtime workers, and audit trails. Deployment is automated through dynamically generated Python scripts that provision and update Azure infrastructure per tenant, so onboarding a new customer doesn't require manual ops work.
Tech decisions worth noting
- LiveKit over building on raw WebRTC. The realtime primitives and turn-taking are battle-tested, and the open-source posture meant we could self-host where needed for compliance.
- Azure AI Foundry for the LLM layer. Aligned with the customer's existing cloud, simplifying procurement and security review.
- Python for the agent runtime. The agentic and RAG ecosystems are stronger in Python, and latency was good enough at our scale.
- TypeScript for the dashboard. Separate codebase, clean API boundary, faster UI iteration.
Outcome
[TODO: Neeraj] Anonymised outcome metrics. Examples: "achieved X ms median latency", "deployed to N tenants", "supports M concurrent calls per tenant", or qualitative wins like "passed enterprise security review on first pass".
What I learned
The hardest engineering wasn't the model layer. It was making the orchestration deterministic enough to debug. Voice agents fail in ways text agents don't (silence, partial speech, barge-in mid-tool-call), and the eval harness for these failure modes is something the open-source ecosystem still hasn't solved. Building one in-house was where most of the senior engineering time went.