Claude API Cost Optimization Playbook
Token budgets, rate limits, and architecture patterns that reduce Claude API spend by 40-60%
Chapter 1: Where Your Tokens Actually Go
Most teams are surprised by the answer. The three biggest token sinks:
Context length — Every request sends the full conversation context. A session starting at 2,000 tokens might send 50,000 tokens per request by step 50. You pay for the full context on every single call.
Retry loops — When an agent hits an error, it retries. Without a circuit breaker, it retries indefinitely. 100 failed calls cost almost as much as 100 successful ones, and produce nothing.
Parallel sessions — Five agents at 100k tokens each is 500k tokens. If they’re all running long sessions simultaneously, actual costs are higher than the sum.
Use the Gateway’s Cost Analytics to see your actual distribution: what percentage goes to context vs. new completions? Where are the cost spikes?
Chapter 2: Per-Project Budgets and Per-Session Caps
Per-session caps are your primary control:
budget:
max_tokens_per_session: 100000
How to set: run your agent 10 times on typical tasks, measure usage, set cap at 2x the average.
| Agent Type | Typical Session | Recommended Cap |
|---|---|---|
| Code review | 20k-50k | 100k |
| CI/CD pipeline | 30k-80k | 200k |
| Research | 50k-150k | 300k |
| Support ticket | 10k-30k | 50k |
Per-project monthly budgets prevent aggregate cost creep:
project_budget:
monthly_limit: $500
alert_thresholds: [50%, 75%, 90%]
on_exceed: pause_and_alert
The 50% alert is the most useful — it fires with time to investigate before hitting the limit.
Chapter 3: Rate Limiting to Catch Loops
Rate limiting is your circuit breaker for retry loops:
rate_limits:
max_requests_per_minute: 30
burst_allowance: 10
on_exceed: throttle_and_alert
Most legitimate agent work averages 5-15 requests/minute. Thirty is generous for bursts but catches loops firing every second.
The retry loop detector catches the most expensive failure mode:
alerts:
- type: retry_loop_detected
threshold: 10_identical_requests
channel: slack:#agent-alerts
action: throttle
When 10 identical requests appear in sequence, something’s wrong. Gateway throttles + alerts, giving you time to investigate.
Chapter 4: Context Optimization
Shorter system prompts — Sent on every request. A 5,000-token system prompt on 100 requests = 500,000 tokens in system prompt alone. Trim to essentials.
Task chunking — Instead of one long session reviewing 50 files, run five sessions of 10 files each. Each starts with fresh context:
| Approach | Tokens | Cost |
|---|---|---|
| One session, 50 files | ~2,000,000 | $6.00 |
| Five sessions, 10 files | ~500,000 | $1.50 |
Chunking uses 75% fewer tokens because each session starts fresh rather than accumulating all previous context.
Chapter 5: Monitoring and Alerting
alerts:
- type: budget_threshold
level: project
thresholds: [50%, 75%, 90%]
channel: slack:#cost-alerts
- type: session_cost_spike
threshold: 3x_average
channel: slack:#cost-alerts
- type: context_growth_anomaly
threshold: 5x_initial_context
channel: slack:#agent-alerts
Chapter 6: Cost Attribution in Multi-Agent Setups
The Gateway tracks cost at four levels: per-request, per-session, per-agent, per-project.
Use this to find your optimization targets. Common findings:
- Research agent costs 5x more than others (processes large documents)
- CI/CD agent has occasional 10x spikes from large diffs
- Support agent is cheap per session but runs 200/day = highest aggregate
Chapter 7: Expected Results
| Optimization | Typical Savings |
|---|---|
| Per-session caps (catching runaways) | 15-25% |
| Rate limiting (catching retry loops) | 10-20% |
| Task chunking (shorter contexts) | 15-25% |
| System prompt optimization | 5-10% |
| Removing unused agents | 5-15% |
Total: 40-60% cost reduction for teams implementing all optimizations. Start with budgets and rate limits — they catch the worst problems immediately and require minimal configuration.
Put this playbook into practice
Sentrely is the managed control plane this playbook is built around. Get early access and deploy in minutes.