A small AI feature can cost $0.001 per call or $1.00 depending on choices that look the same in code. Budgets are a design problem, not a billing one.
Where the cost actually goes
- Context size. Every retrieved doc, every prior turn, every tool result is paying input tokens. Trim aggressively.
- Output length. Cap
max_tokens. Models will happily ramble.
- Reasoning models. Hidden chain-of-thought is billed. Use them only when the task warrants it.
- Retries. A failed tool call followed by a retry is two full requests. Add a retry budget per session.
- Agent loops. N steps × tokens per step. Long horizons compound fast.
Levers in order of impact
- Prompt caching. Anthropic and OpenAI both support it. For repeated system prompts or RAG context this can cut input cost by 90%.
- Smaller model for sub-tasks. Use Haiku, Mini, or Flash for classification, extraction, routing. Save the frontier model for the hard step. See Routing.
- Shorter context. Summarize history past N turns. Retrieve fewer chunks. Strip whitespace from tool outputs.
- Stop conditions. Hard cap on tool-call iterations. Hard cap on tokens per session.
- Batch API. 50% off if you can tolerate ~24h latency. Good for backfills and evals.
Per-user budgets
For consumer apps, set a per-user daily token budget and surface it. Pick one of three: hard cap, soft cap with a warning, or graceful degradation to a smaller model. Without this, one power user can eat a month of your margin.
Reading