08. Cost and Token Optimization

Why Cost Matters

If you're making one-off requests in ChatGPT, cost is irrelevant. But the moment you build something that makes hundreds or thousands of API calls, cost becomes a real engineering concern.

A poorly optimized pipeline can cost $500/month. The same pipeline, optimized, might cost $30. Same quality, same results. The difference is knowing where the money goes.

How Pricing Works

API pricing is based on tokens. You pay separately for:

Input tokens: Everything you send (system prompt + conversation history + your message)
Output tokens: Everything the model generates in response

Output tokens are typically 2 to 4x more expensive than input tokens because generation is more compute-intensive than reading.

GPT-4o pricing (per 1M tokens):
  Input:  $2.50
  Output: $10.00

GPT-4o-mini pricing (per 1M tokens):
  Input:  $0.15
  Output: $0.60

Claude Sonnet pricing (per 1M tokens):
  Input:  $3.00
  Output: $15.00

Claude Haiku pricing (per 1M tokens):
  Input:  $0.25
  Output: $1.25

The difference between GPT-4o and GPT-4o-mini is roughly 16x on input and 16x on output. That's why model selection matters so much for cost.

Optimization Strategies

1. Use the Cheapest Model That Works

This is the single biggest cost lever. If GPT-4o-mini handles your task at acceptable quality, you just saved 16x compared to GPT-4o.

Before: All requests go to GPT-4o
  1000 requests/day × 2000 tokens avg = 2M tokens/day
  Cost: ~$25/day = $750/month

After: Simple tasks go to GPT-4o-mini, complex tasks stay on GPT-4o
  800 simple requests → mini: ~$1.50/day
  200 complex requests → 4o: ~$5/day
  Cost: ~$6.50/day = $195/month

Savings: 74%

2. Reduce Input Tokens

Every token you send costs money. Cut the fat:

Shorter system prompts. Does your system prompt need to be 500 tokens? Can you say the same thing in 100?

❌ "You are a highly skilled and experienced software engineer who
   specializes in reviewing code for potential bugs, security
   vulnerabilities, and performance issues. When you find an issue,
   please explain it clearly and suggest a fix."

✅ "You review code for bugs, security issues, and performance problems.
   For each issue: explain it, suggest a fix."

Same behavior, 60% fewer tokens.

Trim conversation history. Don't send 20 turns of history when the model only needs the last 3 to answer the current question.

Send less code. If the bug is in one function, send that function, not the entire file.

3. Reduce Output Tokens

Output tokens cost 2 to 4x more than input. Constrain the response:

"Respond in under 50 words."
"Return only the JSON, no explanation."
"Answer with yes or no."

Setting max_tokens in the API also helps. If you only need a short answer, cap it at 256 tokens so the model doesn't keep generating unnecessary content.

4. Caching

If you're sending the same system prompt and context repeatedly, use prompt caching (available on OpenAI and Anthropic APIs). Cached input tokens cost 50% less because the provider doesn't need to reprocess them.

Without caching:
  1000 requests × 500 token system prompt = 500K input tokens billed fully

With caching:
  First request: 500 tokens at full price
  Next 999 requests: 500 tokens at 50% price
  Savings: ~250K tokens worth of cost

This matters most when your system prompt or context is large and doesn't change between requests.

How it works in practice: Caching is automatic on OpenAI (for prefixes ≥1024 tokens). You send the same system prompt across requests and the provider caches the processed representation:

# Both requests share the same system prompt — the second one hits cache
curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "system", "content": "[your large system prompt — 1024+ tokens]"},
      {"role": "user", "content": "User question here"}
    ]
  }'

The response usage object shows what was cached:

"usage": {
  "prompt_tokens": 1200,
  "completion_tokens": 85,
  "prompt_tokens_details": {
    "cached_tokens": 1024
  }
}

Key rules: the prefix must be identical across requests (one character difference breaks the cache), and the cache has a TTL of ~5–10 minutes of inactivity. On Anthropic, you explicitly mark cache breakpoints with "cache_control": {"type": "ephemeral"} in the message block.

5. Batching

If your requests aren't time-sensitive, use batch APIs. OpenAI's batch API gives you 50% off in exchange for results delivered within 24 hours instead of immediately.

Good for:

Nightly processing jobs
Bulk classification
Data extraction from large datasets
Generating embeddings

Not good for:

Interactive user-facing features
Anything that needs a response in seconds

Real Cost Breakdown

Here's a real example of optimizing a support ticket classifier:

Task: Classify 5000 support tickets/day into 8 categories

Version 1 (naive):
  Model: GPT-4o
  Input: ~300 tokens/ticket (system prompt + ticket text)
  Output: ~50 tokens/ticket (category + explanation)
  Daily cost: (1.5M × $2.50 + 250K × $10) / 1M = $6.25/day = $187/month

Version 2 (optimized):
  Model: GPT-4o-mini
  Input: ~150 tokens/ticket (shorter prompt, no explanation requested)
  Output: ~5 tokens/ticket (just the category number)
  Daily cost: (750K × $0.15 + 25K × $0.60) / 1M = $0.13/day = $3.90/month

Same accuracy (95% vs 96%). Cost reduction: 98%.

The optimizations:

Switched to mini (16x cheaper)
Shortened the system prompt
Asked for just the category number instead of an explanation
Reduced output from 50 tokens to 5

When Not to Optimize

Don't optimize prematurely. If you're prototyping, use the best model and don't worry about cost. Optimize when:

You're going to production with real volume
Your monthly bill is higher than you'd like
You've validated that the cheaper model produces acceptable quality

The order matters: get it working first, then make it cheap.

Key Takeaways

Output tokens cost 2 to 4x more than input tokens. Constrain your responses.
The biggest cost lever is model selection. Mini/Haiku are 10 to 20x cheaper than frontier models.
Shorter prompts, less history, and less code all reduce input token costs.
Use prompt caching for repeated system prompts (50% savings on cached tokens).
Batch APIs give 50% off for non-urgent workloads.
Don't optimize during prototyping. Get it working first, then make it cheap.
A well-optimized pipeline can cost 90 to 98% less than a naive one with the same quality.