08. Cost and Token Optimization
📋 Jump to TakeawaysWhy Cost Matters
If you're making one-off requests in ChatGPT, cost is irrelevant. But the moment you build something that makes hundreds or thousands of API calls, cost becomes a real engineering concern.
A poorly optimized pipeline can cost $500/month. The same pipeline, optimized, might cost $30. Same quality, same results. The difference is knowing where the money goes.
How Pricing Works
API pricing is based on tokens. You pay separately for:
- Input tokens: Everything you send (system prompt + conversation history + your message)
- Output tokens: Everything the model generates in response
Output tokens are typically 2 to 4x more expensive than input tokens because generation is more compute-intensive than reading.
GPT-4o pricing (per 1M tokens):
Input: $2.50
Output: $10.00
GPT-4o-mini pricing (per 1M tokens):
Input: $0.15
Output: $0.60
Claude Sonnet pricing (per 1M tokens):
Input: $3.00
Output: $15.00
Claude Haiku pricing (per 1M tokens):
Input: $0.25
Output: $1.25The difference between GPT-4o and GPT-4o-mini is roughly 16x on input and 16x on output. That's why model selection matters so much for cost.
Optimization Strategies
1. Use the Cheapest Model That Works
This is the single biggest cost lever. If GPT-4o-mini handles your task at acceptable quality, you just saved 16x compared to GPT-4o.
Before: All requests go to GPT-4o
1000 requests/day × 2000 tokens avg = 2M tokens/day
Cost: ~$25/day = $750/month
After: Simple tasks go to GPT-4o-mini, complex tasks stay on GPT-4o
800 simple requests → mini: ~$1.50/day
200 complex requests → 4o: ~$5/day
Cost: ~$6.50/day = $195/month
Savings: 74%2. Reduce Input Tokens
Every token you send costs money. Cut the fat:
Shorter system prompts. Does your system prompt need to be 500 tokens? Can you say the same thing in 100?
❌ "You are a highly skilled and experienced software engineer who
specializes in reviewing code for potential bugs, security
vulnerabilities, and performance issues. When you find an issue,
please explain it clearly and suggest a fix."
✅ "You review code for bugs, security issues, and performance problems.
For each issue: explain it, suggest a fix."Same behavior, 60% fewer tokens.
Trim conversation history. Don't send 20 turns of history when the model only needs the last 3 to answer the current question.
Send less code. If the bug is in one function, send that function, not the entire file.
3. Reduce Output Tokens
Output tokens cost 2 to 4x more than input. Constrain the response:
"Respond in under 50 words."
"Return only the JSON, no explanation."
"Answer with yes or no."Setting max_tokens in the API also helps. If you only need a short answer, cap it at 256 tokens so the model doesn't keep generating unnecessary content.
4. Caching
If you're sending the same system prompt and context repeatedly, use prompt caching (available on OpenAI and Anthropic APIs). Cached input tokens cost 50% less because the provider doesn't need to reprocess them.
Without caching:
1000 requests × 500 token system prompt = 500K input tokens billed fully
With caching:
First request: 500 tokens at full price
Next 999 requests: 500 tokens at 50% price
Savings: ~250K tokens worth of costThis matters most when your system prompt or context is large and doesn't change between requests.
How it works in practice: Caching is automatic on OpenAI (for prefixes ≥1024 tokens). You send the same system prompt across requests and the provider caches the processed representation:
# Both requests share the same system prompt — the second one hits cache
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "[your large system prompt — 1024+ tokens]"},
{"role": "user", "content": "User question here"}
]
}'The response usage object shows what was cached:
"usage": {
"prompt_tokens": 1200,
"completion_tokens": 85,
"prompt_tokens_details": {
"cached_tokens": 1024
}
}Key rules: the prefix must be identical across requests (one character difference breaks the cache), and the cache has a TTL of ~5–10 minutes of inactivity. On Anthropic, you explicitly mark cache breakpoints with "cache_control": {"type": "ephemeral"} in the message block.
5. Batching
If your requests aren't time-sensitive, use batch APIs. OpenAI's batch API gives you 50% off in exchange for results delivered within 24 hours instead of immediately.
Good for:
- Nightly processing jobs
- Bulk classification
- Data extraction from large datasets
- Generating embeddings
Not good for:
- Interactive user-facing features
- Anything that needs a response in seconds
Real Cost Breakdown
Here's a real example of optimizing a support ticket classifier:
Task: Classify 5000 support tickets/day into 8 categories
Version 1 (naive):
Model: GPT-4o
Input: ~300 tokens/ticket (system prompt + ticket text)
Output: ~50 tokens/ticket (category + explanation)
Daily cost: (1.5M × $2.50 + 250K × $10) / 1M = $6.25/day = $187/month
Version 2 (optimized):
Model: GPT-4o-mini
Input: ~150 tokens/ticket (shorter prompt, no explanation requested)
Output: ~5 tokens/ticket (just the category number)
Daily cost: (750K × $0.15 + 25K × $0.60) / 1M = $0.13/day = $3.90/month
Same accuracy (95% vs 96%). Cost reduction: 98%.The optimizations:
- Switched to mini (16x cheaper)
- Shortened the system prompt
- Asked for just the category number instead of an explanation
- Reduced output from 50 tokens to 5
When Not to Optimize
Don't optimize prematurely. If you're prototyping, use the best model and don't worry about cost. Optimize when:
- You're going to production with real volume
- Your monthly bill is higher than you'd like
- You've validated that the cheaper model produces acceptable quality
The order matters: get it working first, then make it cheap.
Key Takeaways
- Output tokens cost 2 to 4x more than input tokens. Constrain your responses.
- The biggest cost lever is model selection. Mini/Haiku are 10 to 20x cheaper than frontier models.
- Shorter prompts, less history, and less code all reduce input token costs.
- Use prompt caching for repeated system prompts (50% savings on cached tokens).
- Batch APIs give 50% off for non-urgent workloads.
- Don't optimize during prototyping. Get it working first, then make it cheap.
- A well-optimized pipeline can cost 90 to 98% less than a naive one with the same quality.