Architecture

The architecture, end to end.

CallFunnel is the control plane above the model, the channel, and the CRM. Here is the whole system: four planes, a single call traced from dial tone to decision, and a scaling model bounded by provider capacity rather than our own hardware.

The system

Four planes

Responsibilities are split so each layer scales and fails independently. The media layer is stateless and disposable; durable memory lives below it.

01 · REAL-TIME MEDIA

Voice

voice-bot · FastAPI · Pipecat

Holds the live call: bidirectional audio, voice-activity detection, turn-taking and barge-in. Stateless — it keeps call state only for the life of the call.

Twilio / Exotel WS Silero VAD Deepgram Claude Haiku Cartesia / Sarvam

02 · ORCHESTRATION

Conductor

conductor

The brain between calls: workflows, the outbound dialer, the per-customer timeline, approval-in-the-loop, rule-book authoring and usage metering. This is where the agent decides and acts.

async Python LLM tool-calling Claude Opus · authoring Slack approvals

03 · CONTROL / TENANCY

Control plane

cf-control-plane

Tenant lifecycle, authentication, per-tenant secret provisioning and billing. Every tenant is isolated here before a single byte of their data is touched.

Keycloak realm + groups HashiCorp Vault FastAPI

04 · DATA

Memory

shared stores

The durable layer: conversation and timeline state, the retrieval knowledge base, queues and sessions, and call recordings. State lives here so the media layer above can stay disposable.

MongoDB PostgreSQL Qdrant · vectors Redis MinIO

The latency path

Anatomy of a call

One conversational turn, from the caller's voice to the agent's reply. Everything on this path is tuned for sub-second response; everything that isn't latency-critical happens off it.

Audio in

Twilio / Exotel media stream

Turn detected

Silero VAD, on-box

Transcribed

Deepgram streaming partials

Reasoned

Claude Haiku turn + tools

Spoken

Cartesia / Sarvam streaming TTS

Audio out

back to the caller

Mid-turn, the model can call tools that reach the conductor — a record lookup, a concession that needs Slack approval, an escalation. The timeline write and the call recording are persisted asynchronously, off the latency path, so they never slow the conversation.

The economics of scale

How it scales

Because the heavy machine learning is bought as streaming APIs, the platform's own per-call work is I/O-bound — hold two sockets, run light VAD, relay frames. No GPUs, no model hosting. That makes scale a matter of adding identical, inexpensive nodes.

250–400

concurrent calls per tuned 8-vCPU node

~5 nodes

to carry 1,000 concurrent calls, with headroom

Provider-bound

the ceiling is a procurement lever, not a hardware wall

Stateless and horizontal. Voice workers hold call state only for the call's lifetime; all durable state is externalized — so you scale out behind a sticky-WebSocket load balancer.
Warm-buffer autoscaling. KEDA scales on the active-call signal, not CPU, and keeps a warm buffer — because a live call can't absorb a cold start during a surge.
A deterministic ceiling. The binding limit is provider concurrency — telephony channels and model rate tiers. Raise a tier, add nodes, and capacity grows linearly rather than hitting an architectural wall.

Trust boundary

Isolation & security

Identity at every hop. RS256 JWTs issued by Keycloak are validated at each service; per-tenant groups isolate identity and authorization.
Secrets sealed. Per-tenant credentials live in Vault — provider keys and tokens never sit in plaintext env or config.
PII redacted before the model. Sensitive fields are stripped before anything reaches the LLM or the knowledge base.
Tenant-scoped data. Recordings, timeline and knowledge are partitioned per tenant, with export-and-erase for data-subject requests.

Want the full component list and the reason behind each pick? See the Tech Stack. Or start free and drive a real call yourself.