02. How to Choose a Model
📋 Jump to TakeawaysThe Four Dimensions
Every model choice comes down to four tradeoffs:
| Dimension | Question |
|---|---|
| Quality | How good does the output need to be? |
| Speed | How fast do I need the response? |
| Cost | How much am I willing to pay per request? |
| Privacy | Can the data leave my machine? |
No model wins on all four. You're always trading one for another.
The Decision Framework
Ask these questions in order:
1. Does the data need to stay local?
If yes, use a local model (Ollama). Skip cloud APIs entirely. This applies to proprietary code, medical records, legal documents, or anything you can't send to a third party.
2. How complex is the task?
| Complexity | Examples | Model tier | Options |
|---|---|---|---|
| Simple | Classification, extraction, reformatting | Cheapest that passes your quality bar | GPT-4o-mini, Claude Haiku, Gemini Flash |
| Medium | Summarization, code generation, analysis | Mid-tier | GPT-4o, Claude Sonnet, Gemini Pro |
| Hard | Complex reasoning, multi-step logic, novel problems | Thinking / frontier | o3, Claude Opus, DeepSeek-R1 |
3. What's the volume?
| Scenario | Priority | Guidance |
|---|---|---|
| One-off task | Quality | Use the best model available — cost is irrelevant |
| 1,000 req/day | Balance | Use the cheapest model that meets your quality bar |
| 100,000 req/day | Cost | Every cent per request is $1,000/day. Optimize ruthlessly |
4. What's the latency requirement?
| Scenario | Strategy |
|---|---|
| Interactive (user is waiting) | Fast model, streaming enabled |
| Background (batch processing) | Slow model is fine — optimize for cost |
Matching Tasks to Models
| Task | Best choice | Why |
|---|---|---|
| Classify sentiment (pos/neg/neutral) | GPT-4o-mini, Haiku | Simple task, cheapest wins |
| Summarize a meeting transcript | Claude Sonnet | Good at long text, follows format instructions well |
| Generate a REST API in Go | Claude Sonnet, GPT-4o | Both strong at code |
| Review architecture for flaws | o3, DeepSeek-R1 | Needs deep reasoning |
| Extract structured data from emails | GPT-4o-mini | Structured output is GPT's strength |
| Analyze a 200-page PDF | Gemini Pro | 1M token context window fits the whole doc |
| Chat with users in production | GPT-4o-mini, Haiku | Fast, cheap, good enough for conversation |
| Translate to 5 languages | Qwen 2.5 | Strong multilingual support, runs locally |
The "Good Enough" Principle
Don't reach for the best model. Use the cheapest model that produces acceptable output for your specific task.
Example: Classify support tickets into 5 categories
| Model | Accuracy | Cost/ticket | Cost at 10K tickets/day |
|---|---|---|---|
| GPT-4o | 98% | $0.0030 | $30/day |
| GPT-4o-mini | 95% | $0.0002 | $2/day |
| Haiku | 94% | $0.0001 | $1/day |
Haiku wins. The 4% accuracy gap doesn't justify 30× the cost.
For most classification, extraction, and formatting tasks, cheap models are good enough. Save expensive models for tasks where quality actually matters — complex reasoning, creative writing, or nuanced analysis.
Two Phases of Model Selection
Prototyping: Start with the best model to establish what "good" looks like. This gives you a quality ceiling and a reference to compare against. Don't worry about cost yet — you're finding the bar.
Production: Downgrade to the cheapest model that still meets that bar. Every cent matters at scale.
When to Upgrade (or Downgrade)
In production, start cheap and upgrade only when you see actual failures:
- Output is wrong or incomplete → try a larger model
- Model doesn't follow your format → try Claude (better at instruction following)
- Reasoning is shallow or misses edge cases → try a thinking model (o3, R1)
- Context is too long for the model's window → try Gemini (1M tokens)
If you prototyped with a frontier model, downgrade methodically:
- Run your test cases against a cheaper model
- Compare output quality side by side
- If the cheaper model passes your quality bar → ship it
- If it doesn't → move one tier up and test again
The Model Tier List (2025)
Tier 1, Frontier (best quality, highest cost):
- GPT-4o, Claude Sonnet, Gemini 2.5 Pro
Tier 2, Reasoning (slow but deep thinking):
- o3, o4-mini, DeepSeek-R1
Tier 3, Fast and Cheap (90% quality, 10% cost):
- GPT-4o-mini, Claude Haiku, Gemini Flash
Tier 4, Local/Open (free, private, runs on your hardware):
- Llama 3.1 8B, Qwen 2.5 7B/14B, DeepSeek-V3
Most tasks need Tier 3. Some need Tier 1. Few need Tier 2. Use Tier 4 when privacy or eliminating cost entirely matters.
Cloud vs Local Equivalents
| Cloud model | Local equivalent | Notes |
|---|---|---|
| GPT-4o-mini, Haiku | Llama 3.1 8B, Qwen 2.5 7B | Good for simple tasks, fast |
| GPT-4o, Claude Sonnet | Qwen 2.5 14B, Llama 3.1 70B | 14B runs on 16GB RAM; 70B needs a server |
| o3, Claude Opus | DeepSeek-R1 14B/70B | Thinking model, runs locally but slow |
| Gemini Flash | Llama 3.2 3B | Ultra-fast, limited quality |
Local models won't match cloud quality at the same parameter count. But they're free, private, and often good enough for simple to medium tasks.
Real Example: Building a Feature
You're building an AI-powered code review feature. Here's how to think through model selection:
Requirements:
- Reviews pull requests (50–500 lines of code)
- Finds bugs, suggests improvements
- Runs on every PR (high volume)
- Users see results in under 10 seconds
Walking through the framework:
| Question | Answer |
|---|---|
| 1. Privacy? | No — code is already on GitHub. Cloud is fine. |
| 2. Complexity? | Medium — needs to understand code, not prove theorems. |
| 3. Volume? | High — every PR across the team. |
| 4. Latency? | Moderate — under 10 seconds is acceptable. |
Conclusion: Start with GPT-4o-mini. It's fast, cheap, and decent at code. If reviews are too shallow, upgrade to Claude Sonnet. Don't start with o3 — it's slow and expensive for this volume.
Key Takeaways
- Four dimensions: quality, speed, cost, privacy. You always trade between them.
- Prototyping: start with the best model to find your quality ceiling.
- Production: downgrade to the cheapest model that still meets that bar.
- Simple tasks (classification, extraction) → cheap models (mini, Haiku, Flash)
- Complex reasoning → thinking models (o3, R1)
- Long documents → Gemini (1M context)
- Privacy-sensitive data → local models (Ollama)
- Most tasks need a Tier 3 model. Don't default to the most expensive option.