Claude API Cost Optimizer
Cut Claude API costs by 70–90% using intelligent model selection, caching, and batching.
Quick Start
- 1. Audit your current API calls — identify which tasks use Opus or Sonnet that could use Haiku. Model selection alone saves 10–18x on simple tasks.
- Pick the cheapest model tier for each task: Haiku (cheapest) → Sonnet (mid) → Opus (most expensive, use sparingly). See
references/pricing.md for current rates. - Enable prompt caching for repeated context (system prompts, codebases) by adding
"cache_control": {"type": "ephemeral"} to message blocks - Implement cost reporting — track
input_tokens, output_tokens, and cache metrics from API responses
Key Concepts
- - Model selection — Haiku for simple tasks (formatting, comments) — cheapest tier. Sonnet for medium (refactoring, debugging) — mid tier. Opus for complex only (architecture, security) — most expensive, use sparingly. See
references/pricing.md for current rates. - Prompt caching — Cache large static content (system prompts, codebase context). Cache reads cost 90% less; writes pay off after 1–2 reuses.
- Batching — Combine multiple requests into one API call to eliminate per-request overhead. 80% fewer calls ≈ 80% lower cost.
- Local caching — Cache identical responses locally to skip redundant API calls entirely.
- Context extraction — Send only relevant snippets, not whole files. Smaller inputs = lower costs.
- max_tokens discipline — Set realistic limits; unused token budget is wasted money.
Common Usage
Code examples are in Python but concepts apply to any language or SDK.
Model selection pattern:
CODEBLOCK0
Prompt caching:
CODEBLOCK1
Cost tracking:
CODEBLOCK2
References
- -
references/implementation.md — Full implementation patterns, model routing, caching setup, batching, retry logic, and anti-patterns - INLINECODE6 — Current pricing, cache cost math, savings calculations, and batch API details
Claude API 成本优化器
通过智能模型选择、缓存和批处理,将Claude API成本降低70-90%。
快速入门
- 1. 审计当前API调用 — 识别哪些使用Opus或Sonnet的任务可以使用Haiku。仅模型选择一项即可在简单任务上节省10-18倍成本。
- 为每个任务选择最便宜的模型层级:Haiku(最便宜)→ Sonnet(中等)→ Opus(最昂贵,谨慎使用)。查看references/pricing.md了解当前费率。
- 通过向消息块添加cachecontrol: {type: ephemeral},为重复上下文(系统提示、代码库)启用提示缓存。
- 实施成本报告 — 跟踪API响应中的inputtokens、output_tokens和缓存指标。
关键概念
- - 模型选择 — 简单任务(格式化、注释)使用Haiku — 最便宜层级。中等任务(重构、调试)使用Sonnet — 中等层级。仅复杂任务(架构、安全)使用Opus — 最昂贵,谨慎使用。查看references/pricing.md了解当前费率。
- 提示缓存 — 缓存大型静态内容(系统提示、代码库上下文)。缓存读取成本降低90%;写入成本在1-2次复用后即可收回。
- 批处理 — 将多个请求合并为一个API调用,消除单次请求开销。减少80%的调用 ≈ 降低80%的成本。
- 本地缓存 — 在本地缓存相同响应,完全跳过冗余API调用。
- 上下文提取 — 仅发送相关片段,而非整个文件。输入越小 = 成本越低。
- max_tokens纪律 — 设置合理的限制;未使用的token预算就是浪费的钱。
常见用法
代码示例使用Python,但概念适用于任何语言或SDK。
模型选择模式:
python
def selectmodel(tasktype: str) -> str:
simple_tasks = [formatting, comments, explanation, rename]
complextasks = [architecture, algorithm, securityaudit]
return (claude-haiku-4-5-20251001 if tasktype in simpletasks else
claude-opus-4-6 if tasktype in complextasks else
claude-sonnet-4-6)
提示缓存:
python
response = client.messages.create(
model=claude-sonnet-4-6,
max_tokens=1024,
system=[{
type: text,
text: system_prompt,
cache_control: {type: ephemeral}
}],
messages=[{
role: user,
content: [
{type: text, text: fCode:\n{source_code},
cache_control: {type: ephemeral}},
{type: text, text: query}
]
}]
)
成本跟踪:
python
usage = response.usage
cost = (usage.inputtokens * INPUTRATE +
usage.cachecreationinputtokens * CACHEWRITE_RATE +
usage.cachereadinputtokens * CACHEREAD_RATE +
usage.outputtokens * OUTPUTRATE)
参考资料
- - references/implementation.md — 完整实现模式、模型路由、缓存设置、批处理、重试逻辑和反模式
- references/pricing.md — 当前定价、缓存成本计算、节省计算和批处理API详情