cf
CallFunnel.ai

Architecture

The architecture, end to end.

CallFunnel is the control plane above the model, the channel, and the CRM. Here is the whole system: four planes, a single call traced from dial tone to decision, and a scaling model bounded by provider capacity rather than our own hardware.

The system

Four planes

Responsibilities are split so each layer scales and fails independently. The media layer is stateless and disposable; durable memory lives below it.

01 · REAL-TIME MEDIA
Voice
voice-bot · FastAPI · Pipecat

Holds the live call: bidirectional audio, voice-activity detection, turn-taking and barge-in. Stateless — it keeps call state only for the life of the call.

Twilio / Exotel WS Silero VAD Deepgram Claude Haiku Cartesia / Sarvam
02 · ORCHESTRATION
Conductor
conductor

The brain between calls: workflows, the outbound dialer, the per-customer timeline, approval-in-the-loop, rule-book authoring and usage metering. This is where the agent decides and acts.

async Python LLM tool-calling Claude Opus · authoring Slack approvals
03 · CONTROL / TENANCY
Control plane
cf-control-plane

Tenant lifecycle, authentication, per-tenant secret provisioning and billing. Every tenant is isolated here before a single byte of their data is touched.

Keycloak realm + groups HashiCorp Vault FastAPI
04 · DATA
Memory
shared stores

The durable layer: conversation and timeline state, the retrieval knowledge base, queues and sessions, and call recordings. State lives here so the media layer above can stay disposable.

MongoDB PostgreSQL Qdrant · vectors Redis MinIO

The latency path

Anatomy of a call

One conversational turn, from the caller's voice to the agent's reply. Everything on this path is tuned for sub-second response; everything that isn't latency-critical happens off it.

01
Audio in
Twilio / Exotel media stream
02
Turn detected
Silero VAD, on-box
03
Transcribed
Deepgram streaming partials
04
Reasoned
Claude Haiku turn + tools
05
Spoken
Cartesia / Sarvam streaming TTS
06
Audio out
back to the caller

Mid-turn, the model can call tools that reach the conductor — a record lookup, a concession that needs Slack approval, an escalation. The timeline write and the call recording are persisted asynchronously, off the latency path, so they never slow the conversation.

The economics of scale

How it scales

Because the heavy machine learning is bought as streaming APIs, the platform's own per-call work is I/O-bound — hold two sockets, run light VAD, relay frames. No GPUs, no model hosting. That makes scale a matter of adding identical, inexpensive nodes.

250–400
concurrent calls per tuned 8-vCPU node
~5 nodes
to carry 1,000 concurrent calls, with headroom
Provider-bound
the ceiling is a procurement lever, not a hardware wall
  • Stateless and horizontal. Voice workers hold call state only for the call's lifetime; all durable state is externalized — so you scale out behind a sticky-WebSocket load balancer.
  • Warm-buffer autoscaling. KEDA scales on the active-call signal, not CPU, and keeps a warm buffer — because a live call can't absorb a cold start during a surge.
  • A deterministic ceiling. The binding limit is provider concurrency — telephony channels and model rate tiers. Raise a tier, add nodes, and capacity grows linearly rather than hitting an architectural wall.

Trust boundary

Isolation & security

  • Identity at every hop. RS256 JWTs issued by Keycloak are validated at each service; per-tenant groups isolate identity and authorization.
  • Secrets sealed. Per-tenant credentials live in Vault — provider keys and tokens never sit in plaintext env or config.
  • PII redacted before the model. Sensitive fields are stripped before anything reaches the LLM or the knowledge base.
  • Tenant-scoped data. Recordings, timeline and knowledge are partitioned per tenant, with export-and-erase for data-subject requests.

Want the full component list and the reason behind each pick? See the Tech Stack. Or start free and drive a real call yourself.