02. How to Choose a Model

The Four Dimensions

Every model choice comes down to four tradeoffs:

Dimension	Question
Quality	How good does the output need to be?
Speed	How fast do I need the response?
Cost	How much am I willing to pay per request?
Privacy	Can the data leave my machine?

No model wins on all four. You're always trading one for another.

The Decision Framework

Ask these questions in order:

1. Does the data need to stay local?

If yes, use a local model (Ollama). Skip cloud APIs entirely. This applies to proprietary code, medical records, legal documents, or anything you can't send to a third party.

2. How complex is the task?

Complexity	Examples	Model tier	Options
Simple	Classification, extraction, reformatting	Cheapest that passes your quality bar	GPT-4o-mini, Claude Haiku, Gemini Flash
Medium	Summarization, code generation, analysis	Mid-tier	GPT-4o, Claude Sonnet, Gemini Pro
Hard	Complex reasoning, multi-step logic, novel problems	Thinking / frontier	o3, Claude Opus, DeepSeek-R1

3. What's the volume?

Scenario	Priority	Guidance
One-off task	Quality	Use the best model available — cost is irrelevant
1,000 req/day	Balance	Use the cheapest model that meets your quality bar
100,000 req/day	Cost	Every cent per request is $1,000/day. Optimize ruthlessly

4. What's the latency requirement?

Scenario	Strategy
Interactive (user is waiting)	Fast model, streaming enabled
Background (batch processing)	Slow model is fine — optimize for cost

Matching Tasks to Models

Task	Best choice	Why
Classify sentiment (pos/neg/neutral)	GPT-4o-mini, Haiku	Simple task, cheapest wins
Summarize a meeting transcript	Claude Sonnet	Good at long text, follows format instructions well
Generate a REST API in Go	Claude Sonnet, GPT-4o	Both strong at code
Review architecture for flaws	o3, DeepSeek-R1	Needs deep reasoning
Extract structured data from emails	GPT-4o-mini	Structured output is GPT's strength
Analyze a 200-page PDF	Gemini Pro	1M token context window fits the whole doc
Chat with users in production	GPT-4o-mini, Haiku	Fast, cheap, good enough for conversation
Translate to 5 languages	Qwen 2.5	Strong multilingual support, runs locally

The "Good Enough" Principle

Don't reach for the best model. Use the cheapest model that produces acceptable output for your specific task.

Example: Classify support tickets into 5 categories

Model	Accuracy	Cost/ticket	Cost at 10K tickets/day
GPT-4o	98%	$0.0030	$30/day
GPT-4o-mini	95%	$0.0002	$2/day
Haiku	94%	$0.0001	$1/day

Haiku wins. The 4% accuracy gap doesn't justify 30× the cost.

For most classification, extraction, and formatting tasks, cheap models are good enough. Save expensive models for tasks where quality actually matters — complex reasoning, creative writing, or nuanced analysis.

Two Phases of Model Selection

Prototyping: Start with the best model to establish what "good" looks like. This gives you a quality ceiling and a reference to compare against. Don't worry about cost yet — you're finding the bar.

Production: Downgrade to the cheapest model that still meets that bar. Every cent matters at scale.

When to Upgrade (or Downgrade)

In production, start cheap and upgrade only when you see actual failures:

Output is wrong or incomplete → try a larger model
Model doesn't follow your format → try Claude (better at instruction following)
Reasoning is shallow or misses edge cases → try a thinking model (o3, R1)
Context is too long for the model's window → try Gemini (1M tokens)

If you prototyped with a frontier model, downgrade methodically:

Run your test cases against a cheaper model
Compare output quality side by side
If the cheaper model passes your quality bar → ship it
If it doesn't → move one tier up and test again

The Model Tier List (2025)

Tier 1, Frontier (best quality, highest cost):

GPT-4o, Claude Sonnet, Gemini 2.5 Pro

Tier 2, Reasoning (slow but deep thinking):

o3, o4-mini, DeepSeek-R1

Tier 3, Fast and Cheap (90% quality, 10% cost):

GPT-4o-mini, Claude Haiku, Gemini Flash

Tier 4, Local/Open (free, private, runs on your hardware):

Llama 3.1 8B, Qwen 2.5 7B/14B, DeepSeek-V3

Most tasks need Tier 3. Some need Tier 1. Few need Tier 2. Use Tier 4 when privacy or eliminating cost entirely matters.

Cloud vs Local Equivalents

Cloud model	Local equivalent	Notes
GPT-4o-mini, Haiku	Llama 3.1 8B, Qwen 2.5 7B	Good for simple tasks, fast
GPT-4o, Claude Sonnet	Qwen 2.5 14B, Llama 3.1 70B	14B runs on 16GB RAM; 70B needs a server
o3, Claude Opus	DeepSeek-R1 14B/70B	Thinking model, runs locally but slow
Gemini Flash	Llama 3.2 3B	Ultra-fast, limited quality

Local models won't match cloud quality at the same parameter count. But they're free, private, and often good enough for simple to medium tasks.

Real Example: Building a Feature

You're building an AI-powered code review feature. Here's how to think through model selection:

Requirements:

Reviews pull requests (50–500 lines of code)
Finds bugs, suggests improvements
Runs on every PR (high volume)
Users see results in under 10 seconds

Walking through the framework:

Question	Answer
1. Privacy?	No — code is already on GitHub. Cloud is fine.
2. Complexity?	Medium — needs to understand code, not prove theorems.
3. Volume?	High — every PR across the team.
4. Latency?	Moderate — under 10 seconds is acceptable.

Conclusion: Start with GPT-4o-mini. It's fast, cheap, and decent at code. If reviews are too shallow, upgrade to Claude Sonnet. Don't start with o3 — it's slow and expensive for this volume.

Key Takeaways

Four dimensions: quality, speed, cost, privacy. You always trade between them.
Prototyping: start with the best model to find your quality ceiling.
Production: downgrade to the cheapest model that still meets that bar.
Simple tasks (classification, extraction) → cheap models (mini, Haiku, Flash)
Complex reasoning → thinking models (o3, R1)
Long documents → Gemini (1M context)
Privacy-sensitive data → local models (Ollama)
Most tasks need a Tier 3 model. Don't default to the most expensive option.