Tech Stack

What runs under the hood.

Every component is chosen against the same two constraints: sub-second voice latency, and horizontal scale bounded by provider capacity rather than our own hardware. Here is the whole stack, and the reason for each call.

The design principle is simple: the heavy machine learning is bought as streaming APIs; what we run ourselves is deliberately lightweight and stateless. That keeps per-call cost I/O-bound — no GPUs, no model hosting — so the platform scales out on cheap, identical nodes until it meets a provider's concurrency limit, which is a procurement lever rather than an engineering wall.

Real-time media — what we run

Pipecat

Real-time voice pipeline

Purpose-built for low-latency conversational voice — VAD → STT → LLM → TTS with barge-in. Provider-agnostic adapters; we don't hand-build a fragile turn-taking state machine.

FastAPI

Webhook + WebSocket server

Native asyncio — a single event loop holds many concurrent call sockets, matching the I/O-bound profile. First-class WebSocket support for bidirectional audio.

Silero VAD

On-box voice detection

A tiny CPU model that detects turn boundaries with no network round-trip — keeping barge-in latency low and avoiding a per-minute vendor charge for voice activity detection.

AI services — bought as streaming APIs

Claude · Anthropic

Dialogue + tool-calling

Haiku 4.5 for sub-second in-call turns; Opus for offline rule-book authoring. Strong tool-use lets the agent act mid-call, and prompt caching cuts cost on the system prompt and history.

Deepgram

Streaming speech-to-text

Low-latency streaming with partial transcripts — required for barge-in and natural turn-taking. Multilingual, with high concurrency tiers.

Cartesia

Streaming TTS · English

Very low time-to-first-audio with natural prosody; speech streams out before the full turn is generated, shrinking perceived latency.

Sarvam

Streaming TTS · Indian languages

Native Hindi, Hinglish and Indian-English quality that Western engines lack — so the India market is served without compromise.

Voyage

Embeddings for retrieval

Turns each tenant's documents into vectors for the knowledge base, so the agent grounds its answers in your content rather than the model's priors.

Telephony

Twilio

Global programmable voice

Bidirectional audio over WebSocket (Media Streams), global PSTN reach, and a mature API.

Exotel

India-local carrier

Better Indian connectivity, compliance and per-minute economics. Running both providers gives geo-routing, redundancy and no single-vendor lock-in.

Data

MongoDB

Timeline · calls · tenants

A document model that fits append-heavy, schema-flexible event and timeline data, with straightforward horizontal sharding.

PostgreSQL

Control-plane state

ACID and relational integrity exactly where it matters — tenancy and billing.

Qdrant

Vector database · RAG

Fast approximate-nearest-neighbour search over the embeddings, so retrieval grounds every answer in tenant-specific knowledge.

Redis

Queues · sessions · signal

In-memory speed for ephemeral hot-path state — and the source of the active-load signal that drives autoscaling.

MinIO

S3-compatible storage

Call recordings and assets on tenant-controlled storage with the S3 API — compliance-friendly, with no cloud lock-in.

Identity & secrets

Keycloak

OIDC identity provider

A realm with per-tenant groups gives clean multi-tenant identity isolation; standards-based RS256 JWTs are validated uniformly at every service hop.

HashiCorp Vault

Per-tenant secret store

Sealed, audited secret storage — provider keys and tenant tokens never sit in plaintext env or config.

Orchestration & runtime

k3s

Lightweight Kubernetes

The full Kubernetes API at a fraction of the footprint — a clean growth path from a single node to a multi-node fleet.

KEDA

Event-driven autoscaling

Scales on the real signal — active calls and queue depth — not CPU, which is the correct trigger for an I/O-bound workload.

nginx

Reverse proxy · TLS · WS

Battle-tested WebSocket upgrade and TLS termination, with sticky routing that pins a call to its worker for the call's lifetime.

Docker + systemd

Packaging + supervision

Reproducible, isolated dependency deployment, with reliable supervision of the services that run outside the Kubernetes path.

Python · async

Primary language

The AI and voice ecosystem is Python-first; asyncio matches the concurrency model, and one language spans the media, orchestration and control planes.

Want the architecture behind these choices — the four planes, the call data-flow, and the scaling model? See the Architecture page, or start free and drive it yourself.