Home / Playbooks / Claude API Cost Optimization Playbook
costtoken-budgetoptimizationrate-limiting

Claude API Cost Optimization Playbook

Token budgets, rate limits, and architecture patterns that reduce Claude API spend by 40-60%

14 min read ·
Last updated

Chapter 1: Where Your Tokens Actually Go

Most teams are surprised by the answer. The three biggest token sinks:

Context length — Every request sends the full conversation context. A session starting at 2,000 tokens might send 50,000 tokens per request by step 50. You pay for the full context on every single call.

Retry loops — When an agent hits an error, it retries. Without a circuit breaker, it retries indefinitely. 100 failed calls cost almost as much as 100 successful ones, and produce nothing.

Parallel sessions — Five agents at 100k tokens each is 500k tokens. If they’re all running long sessions simultaneously, actual costs are higher than the sum.

Use the Gateway’s Cost Analytics to see your actual distribution: what percentage goes to context vs. new completions? Where are the cost spikes?

Chapter 2: Per-Project Budgets and Per-Session Caps

Per-session caps are your primary control:

budget:
  max_tokens_per_session: 100000

How to set: run your agent 10 times on typical tasks, measure usage, set cap at 2x the average.

Agent TypeTypical SessionRecommended Cap
Code review20k-50k100k
CI/CD pipeline30k-80k200k
Research50k-150k300k
Support ticket10k-30k50k

Per-project monthly budgets prevent aggregate cost creep:

project_budget:
  monthly_limit: $500
  alert_thresholds: [50%, 75%, 90%]
  on_exceed: pause_and_alert

The 50% alert is the most useful — it fires with time to investigate before hitting the limit.

Chapter 3: Rate Limiting to Catch Loops

Rate limiting is your circuit breaker for retry loops:

rate_limits:
  max_requests_per_minute: 30
  burst_allowance: 10
  on_exceed: throttle_and_alert

Most legitimate agent work averages 5-15 requests/minute. Thirty is generous for bursts but catches loops firing every second.

The retry loop detector catches the most expensive failure mode:

alerts:
  - type: retry_loop_detected
    threshold: 10_identical_requests
    channel: slack:#agent-alerts
    action: throttle

When 10 identical requests appear in sequence, something’s wrong. Gateway throttles + alerts, giving you time to investigate.

Chapter 4: Context Optimization

Shorter system prompts — Sent on every request. A 5,000-token system prompt on 100 requests = 500,000 tokens in system prompt alone. Trim to essentials.

Task chunking — Instead of one long session reviewing 50 files, run five sessions of 10 files each. Each starts with fresh context:

ApproachTokensCost
One session, 50 files~2,000,000$6.00
Five sessions, 10 files~500,000$1.50

Chunking uses 75% fewer tokens because each session starts fresh rather than accumulating all previous context.

Chapter 5: Monitoring and Alerting

alerts:
  - type: budget_threshold
    level: project
    thresholds: [50%, 75%, 90%]
    channel: slack:#cost-alerts

  - type: session_cost_spike
    threshold: 3x_average
    channel: slack:#cost-alerts

  - type: context_growth_anomaly
    threshold: 5x_initial_context
    channel: slack:#agent-alerts

Chapter 6: Cost Attribution in Multi-Agent Setups

The Gateway tracks cost at four levels: per-request, per-session, per-agent, per-project.

Use this to find your optimization targets. Common findings:

  • Research agent costs 5x more than others (processes large documents)
  • CI/CD agent has occasional 10x spikes from large diffs
  • Support agent is cheap per session but runs 200/day = highest aggregate

Chapter 7: Expected Results

OptimizationTypical Savings
Per-session caps (catching runaways)15-25%
Rate limiting (catching retry loops)10-20%
Task chunking (shorter contexts)15-25%
System prompt optimization5-10%
Removing unused agents5-15%

Total: 40-60% cost reduction for teams implementing all optimizations. Start with budgets and rate limits — they catch the worst problems immediately and require minimal configuration.

// get-started

Put this playbook into practice

Sentrely is the managed control plane this playbook is built around. Get early access and deploy in minutes.

AI agent stories, every 2 weeks

Real-world lessons on running AI agents in production — RBAC patterns, audit gotchas, approval workflows. No spam.

Unsubscribe anytime · No spam, ever

// talk-to-us

Tell us what you're building

We reply within one business day.

Platforms / tools you're using or evaluating *

Or email us directly at jordan@sentrely.com

get early access

Get early access

Leave your details and we'll reach out to get you set up.

No spam. We'll only use this to set up your access.