LLM Cost Optimization: Cut Your AI API Bills 10x
LLM API costs can spiral fast. Here are the techniques I use to keep costs under control across six AI products — from model tiering to prompt caching to batching.
LLM API costs are one of the top concerns for AI product builders. At scale, naive usage patterns cost 10-100× more than necessary. Here are the techniques I use across my products to keep costs reasonable without sacrificing quality.
The baseline problem
A naive AI product calls the most capable model for every request, passes the full conversation history every time, and makes synchronous calls one-by-one. This is:
- Expensive: Claude Sonnet at $3/1M input + $15/1M output is 100× more expensive than Haiku at $0.25/$1.25
- Slow: sequential calls stack latency
- Wasteful: 90% of requests don't need the most capable model
Technique 1: Model tiering
Route requests to the cheapest model that can handle them:
type ModelRouter struct {
tiers []ModelTier
}
type ModelTier struct {
Model string
MaxTokens int
UseFor []TaskType
}
var DefaultTiers = []ModelTier{
{
Model: "claude-haiku-4-5",
UseFor: []TaskType{IntentClassify, SimpleAnswer, Summarize},
},
{
Model: "claude-sonnet-4-6",
UseFor: []TaskType{CodeGeneration, ComplexReasoning, LongForm},
},
{
Model: "claude-opus-4-8",
UseFor: []TaskType{ArchitectureDesign, NuancedJudgment},
},
}
func (r *ModelRouter) Route(task Task) string {
for _, tier := range r.tiers {
for _, taskType := range tier.UseFor {
if task.Type == taskType {
return tier.Model
}
}
}
return "claude-haiku-4-5" // default to cheapest
}
In BikroyBuddy: 75% of requests go to Haiku (intent classification, simple replies), 25% to Sonnet (negotiation). This alone reduces costs by ~70% vs all-Sonnet.
Technique 2: Prompt caching
Claude, OpenAI, and most major providers support prompt caching — the system prompt is cached server-side and not billed on subsequent requests (or billed at a heavily discounted rate).
# Claude API with prompt caching
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 2000 tokens
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_message}]
)
With caching: the 2000-token system prompt is charged at write-time only. On cache hit: charged at 10% of normal input price. For a chatbot with 100 messages/day and a 2000-token system prompt:
- Without caching: 100 × 2000 × $3/1M = $0.60/day
- With caching (90% hit rate): $0.06/day
Save: 90% on system prompt tokens.
Technique 3: Free models for non-critical paths
Via OpenRouter, several capable open-source models are available for free (rate-limited):
| Model | OpenRouter free? | Use for | |-------|-----------------|---------| | Llama 3.1 70B | ✅ | Summarization, classification | | Mistral 7B | ✅ | Simple Q&A, extraction | | Gemma 2 9B | ✅ | Lightweight tasks | | Claude Haiku | ❌ | Paid but very cheap | | Claude Sonnet | ❌ | Paid, for complex work |
from openai import OpenAI
# Route to free model for batch processing
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=OPENROUTER_KEY
)
response = client.chat.completions.create(
model="meta-llama/llama-3.1-70b-instruct:free",
messages=[{"role": "user", "content": classify_prompt}]
)
I use free models for: daily batch jobs, internal tools, non-customer-facing processing, and research tasks. Customer-facing features use paid models.
Technique 4: Batching
Many LLM providers (including Anthropic) offer batch APIs with 50% discounts for non-real-time processing:
# Anthropic Batch API — process many messages at once
import anthropic
client = anthropic.Anthropic()
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"product_{i}",
"params": {
"model": "claude-haiku-4-5",
"max_tokens": 100,
"messages": [{"role": "user", "content": f"Classify: {product}"}]
}
}
for i, product in enumerate(product_list)
]
)
# Poll for completion (typically < 1 hour)
while batch.processing_status == "in_progress":
time.sleep(60)
batch = client.messages.batches.retrieve(batch.id)
results = client.messages.batches.results(batch.id)
50% discount for batch processing means nightly batch jobs (product classification, report generation, analytics) cost half of real-time API calls.
Technique 5: Context window management
Long conversation histories are expensive — every message in history is billed as input tokens. Strategies:
Sliding window: keep only the last N messages:
def trim_history(history: list, max_messages: int = 10) -> list:
if len(history) > max_messages:
# Keep system context + last N messages
return history[:1] + history[-max_messages:]
return history
Summarization: periodically summarize old messages:
async def summarize_old_context(history: list, threshold: int = 20) -> list:
if len(history) > threshold:
old_messages = history[1:threshold//2] # skip system message
summary = await llm.complete(f"Summarize these messages in 100 words:\n{format(old_messages)}")
return [history[0], {"role": "assistant", "content": f"[Previous summary]: {summary}"}] + history[threshold//2:]
return history
Cost dashboard
I track per-product LLM costs in a simple PostgreSQL table:
CREATE TABLE llm_cost_log (
id BIGSERIAL PRIMARY KEY,
product TEXT,
model TEXT,
task_type TEXT,
input_tokens INT,
output_tokens INT,
cost_usd NUMERIC(10, 6),
created_at TIMESTAMPTZ DEFAULT now()
);
Daily aggregate query shows where costs are concentrated. In BikroyBuddy, this revealed that 15% of negotiation sessions were consuming 60% of Sonnet spend — users who never converted. Added a "low intent" classifier that routes obvious non-buyers to Haiku instead.
FAQ
What is LLM prompt caching? Prompt caching stores frequently-used prompt prefixes (like system prompts) server-side so they don't need to be re-processed on every request. Claude and OpenAI both support it. On cache hit, input tokens are charged at a discount (typically 10% of normal price).
How much cheaper is Claude Haiku than Claude Sonnet? Haiku 4.5 costs $0.25/1M input, $1.25/1M output. Sonnet 4.6 costs $3/1M input, $15/1M output. Sonnet is 12× more expensive for input, 12× for output. Use Haiku for any task where quality difference isn't user-visible.
What's the Anthropic Batch API? The Anthropic Batch API processes large numbers of requests asynchronously with a 50% discount vs real-time API calls. Results are available within 24 hours (typically much faster). Ideal for nightly processing jobs, data enrichment, and any non-real-time AI task.
Should I self-host an open-source LLM to save money? Self-hosting pays off above ~50M tokens/day for 8B models. Below that, hosted APIs (especially OpenRouter free tier) are cheaper after factoring in GPU instance costs and operational overhead.
Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: Self-Hosting LLMs: Llama 3, Mistral on Your Server · Deploy Always-On AI Agents on AWS for ~$17/mo.