How I Scaled an AI Agent to 300k+ Users on Kubernetes
Scaling a social-commerce AI agent to 300k+ users meant queueing, autoscaling on Kubernetes, and tight cost control. Here's the architecture that worked.
Scaling BikroyBuddy — an AI shopping agent for Bangladesh — from 5,000 to 300,000+ users required rethinking every layer of the architecture. The original single-instance Go server with synchronous LLM calls didn't survive 10x growth. Here's what the scaled architecture looks like and what broke along the way.
The original architecture (0–5k users)
WhatsApp → Webhook handler (Go, single EC2) → Claude API → Response
Simple. Worked. Broke at ~8,000 concurrent WebSocket sessions (EC2 t3.medium runs out of memory handling open connections + LLM response buffering simultaneously).
The scaled architecture (300k+ users)
WhatsApp API
→ ALB (AWS Load Balancer)
→ Webhook receivers (Go, stateless, K8s Deployment, 3–20 replicas)
→ SQS FIFO queue (per-conversation ordering)
→ Message workers (Go, K8s Deployment, 10–50 replicas)
→ Intent classifier (Claude Haiku, 100ms SLA)
→ [Branch by intent]
→ Product search (pgvector, <20ms)
→ Negotiation handler (Claude Sonnet + state machine)
→ Simple reply (Claude Haiku)
→ Response sender (Go, WhatsApp API calls)
→ Redis (conversation state, 24-hour TTL)
→ PostgreSQL (user data, product catalog, permanent records)
Key changes: webhook receivers are now stateless (no in-memory state), message processing is async via SQS, and the LLM calls are isolated in worker pods that autoscale independently.
Kubernetes setup on EKS
# Horizontal autoscaling for message workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: bikroybuddy-workers
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: bikroybuddy-workers
minReplicas: 10
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: bikroybuddy-messages
target:
type: AverageValue
averageValue: "100" # scale up when >100 msgs/worker
The HPA scales workers based on SQS queue depth (custom metric via KEDA). At 100 messages per worker, add pods. At < 10 messages per worker, scale down. Scale-up takes ~90 seconds (pod scheduling + container start).
The SQS FIFO queue: per-conversation ordering
WhatsApp delivers messages out of order under load. Without ordering, conversation state breaks — the reply to message 1 might arrive before message 1's processing finishes.
SQS FIFO with MessageGroupId = conversation_id ensures messages in the same conversation are processed in order:
func (h *WebhookHandler) Enqueue(msg WhatsAppMessage) error {
_, err := h.sqs.SendMessage(&sqs.SendMessageInput{
QueueUrl: aws.String(h.queueURL),
MessageBody: aws.String(encodeMessage(msg)),
MessageGroupId: aws.String(msg.ConversationID),
MessageDeduplicationId: aws.String(msg.MessageID),
})
return err
}
FIFO deduplication (MessageDeduplicationId) also handles WhatsApp's at-least-once delivery — duplicate webhooks don't produce duplicate responses.
Redis for conversation state
Each conversation's state (negotiation phase, current offer, product being discussed) lives in Redis with a 24-hour TTL:
type ConversationState struct {
Phase NegotiationPhase
ProductID string
LastOffer int
Turns int
LastUpdated time.Time
}
func (r *RedisStore) GetState(convID string) (*ConversationState, error) {
data, err := r.client.Get(ctx, "conv:"+convID).Bytes()
if err == redis.Nil {
return &ConversationState{Phase: PhaseOpen}, nil // new conversation
}
var state ConversationState
json.Unmarshal(data, &state)
return &state, nil
}
func (r *RedisStore) SaveState(convID string, state *ConversationState) error {
data, _ := json.Marshal(state)
return r.client.Set(ctx, "conv:"+convID, data, 24*time.Hour).Err()
}
Redis Cluster mode with 3 shards handles the ~300k active conversation states. Memory: ~2KB per conversation state × 300k = ~600MB (well within cluster capacity).
Cost at 300k users
| Service | Monthly cost | |---------|-------------| | EKS cluster (3 m5.large nodes) | $220 | | Worker pods (avg 20 replicas, spot) | $180 | | SQS | $12 | | ElastiCache Redis Cluster | $130 | | RDS PostgreSQL (db.r6g.large) | $190 | | Claude Haiku (intent classify) | $280 | | Claude Sonnet (negotiations) | $420 | | WhatsApp API (Meta) | $0 (conversation-based pricing, mostly free tier) | | ALB + networking | $60 | | Total | ~$1,492/mo |
At 300k users, ~$0.005/user/month. Subscription revenue covers this with 4× margin.
What broke at each growth stage
10k users: Connection pool exhaustion on PostgreSQL. Fix: pgBouncer in transaction mode.
50k users: Redis single instance OOM. Fix: Redis Cluster with 3 shards.
100k users: Claude Sonnet latency spikes under concurrent load (>10 parallel calls to Claude API). Fix: intent classifier routes only genuine negotiation to Sonnet; 75% of requests now go to Haiku.
200k users: EKS node autoscaler too slow for traffic spikes (WhatsApp usage peaks sharply at 7pm local). Fix: predictive scaling (pre-warm 15 minutes before predicted peak).
FAQ
How many users can a single Go process handle? Depends on workload. For stateless webhook receivers, a single t3.medium handles ~500 concurrent requests comfortably. For LLM-powered workers with blocking API calls, each worker handles one request at a time per goroutine — parallelism comes from replicas, not concurrency within a pod.
Why Kubernetes instead of ECS for this scale? At 300k users with 50+ pods, ECS Fargate costs became prohibitive. EKS with spot instances on Graviton2 nodes reduced compute costs by ~40%. For smaller scales, ECS Fargate is simpler.
How do you handle WhatsApp rate limits? Meta's WhatsApp Business API has conversation-based pricing and rate limits per phone number. BikroyBuddy uses multiple numbers (one per region) with a router that distributes load across them.
What's the P99 response latency? Haiku-powered responses: 800ms P99. Sonnet-powered negotiations: 2.1s P99. The 2.1s is acceptable for negotiation but would be unacceptable for a simple reply — hence the tiered routing.
Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: How I Built BikroyBuddy · Microservices as One Engineer.