Tech Stack
What runs under the hood.
Every component is chosen against the same two constraints: sub-second voice latency, and horizontal scale bounded by provider capacity rather than our own hardware. Here is the whole stack, and the reason for each call.
The design principle is simple: the heavy machine learning is bought as streaming APIs; what we run ourselves is deliberately lightweight and stateless. That keeps per-call cost I/O-bound — no GPUs, no model hosting — so the platform scales out on cheap, identical nodes until it meets a provider's concurrency limit, which is a procurement lever rather than an engineering wall.
Real-time media — what we run
Pipecat
Real-time voice pipeline
Purpose-built for low-latency conversational voice — VAD → STT → LLM → TTS with barge-in. Provider-agnostic adapters; we don't hand-build a fragile turn-taking state machine.
FastAPI
Webhook + WebSocket server
Native asyncio — a single event loop holds many concurrent call sockets, matching the I/O-bound profile. First-class WebSocket support for bidirectional audio.
Silero VAD
On-box voice detection
A tiny CPU model that detects turn boundaries with no network round-trip — keeping barge-in latency low and avoiding a per-minute vendor charge for voice activity detection.
AI services — bought as streaming APIs
Claude · Anthropic
Dialogue + tool-calling
Haiku 4.5 for sub-second in-call turns; Opus for offline rule-book authoring. Strong tool-use lets the agent act mid-call, and prompt caching cuts cost on the system prompt and history.
Deepgram
Streaming speech-to-text
Low-latency streaming with partial transcripts — required for barge-in and natural turn-taking. Multilingual, with high concurrency tiers.
Cartesia
Streaming TTS · English
Very low time-to-first-audio with natural prosody; speech streams out before the full turn is generated, shrinking perceived latency.
Sarvam
Streaming TTS · Indian languages
Native Hindi, Hinglish and Indian-English quality that Western engines lack — so the India market is served without compromise.
Voyage
Embeddings for retrieval
Turns each tenant's documents into vectors for the knowledge base, so the agent grounds its answers in your content rather than the model's priors.
Telephony
Twilio
Global programmable voice
Bidirectional audio over WebSocket (Media Streams), global PSTN reach, and a mature API.
Exotel
India-local carrier
Better Indian connectivity, compliance and per-minute economics. Running both providers gives geo-routing, redundancy and no single-vendor lock-in.
Data
MongoDB
Timeline · calls · tenants
A document model that fits append-heavy, schema-flexible event and timeline data, with straightforward horizontal sharding.
PostgreSQL
Control-plane state
ACID and relational integrity exactly where it matters — tenancy and billing.
Qdrant
Vector database · RAG
Fast approximate-nearest-neighbour search over the embeddings, so retrieval grounds every answer in tenant-specific knowledge.
Redis
Queues · sessions · signal
In-memory speed for ephemeral hot-path state — and the source of the active-load signal that drives autoscaling.
MinIO
S3-compatible storage
Call recordings and assets on tenant-controlled storage with the S3 API — compliance-friendly, with no cloud lock-in.
Identity & secrets
Keycloak
OIDC identity provider
A realm with per-tenant groups gives clean multi-tenant identity isolation; standards-based RS256 JWTs are validated uniformly at every service hop.
HashiCorp Vault
Per-tenant secret store
Sealed, audited secret storage — provider keys and tenant tokens never sit in plaintext env or config.
Orchestration & runtime
k3s
Lightweight Kubernetes
The full Kubernetes API at a fraction of the footprint — a clean growth path from a single node to a multi-node fleet.
KEDA
Event-driven autoscaling
Scales on the real signal — active calls and queue depth — not CPU, which is the correct trigger for an I/O-bound workload.
nginx
Reverse proxy · TLS · WS
Battle-tested WebSocket upgrade and TLS termination, with sticky routing that pins a call to its worker for the call's lifetime.
Docker + systemd
Packaging + supervision
Reproducible, isolated dependency deployment, with reliable supervision of the services that run outside the Kubernetes path.
Python · async
Primary language
The AI and voice ecosystem is Python-first; asyncio matches the concurrency model, and one language spans the media, orchestration and control planes.
Want the architecture behind these choices — the four planes, the call data-flow, and the scaling model? See the Architecture page, or start free and drive it yourself.