03. Thinking Models vs Fast Models

Two Modes of AI

Models now come in two fundamentally different modes:

Fast models respond immediately. They generate tokens left-to-right without pausing to plan. GPT-4o, Claude Sonnet, Gemini Flash. You get a response in 1 to 3 seconds.

Thinking models pause before responding. They spend time reasoning internally, exploring approaches, and checking their work before giving you an answer. o3, o4-mini, DeepSeek-R1. Response time is 10 to 60 seconds.

This isn't just a speed difference. It's a fundamentally different approach to problem-solving.

How Thinking Models Work

A fast model sees "What's 17 × 24?" and immediately starts generating: "The answer is..." It sometimes gets it right, sometimes doesn't.

A thinking model sees the same question and internally reasons:

Thinking: 17 × 24
= 17 × 20 + 17 × 4
= 340 + 68
= 408

Then outputs: "408."

The thinking happens in hidden tokens you don't see (or in DeepSeek-R1's case, tokens you can see). The model is essentially talking to itself before talking to you.

When Thinking Models Win

Thinking models are dramatically better at:

Math and logic

Fast model: "A farmer has 17 sheep. All but 9 die. How many are left?"
→ Often says "8" (17 - 9, wrong interpretation)

Thinking model: Reasons through the language carefully.
→ "9" (correct — "all but 9" means 9 survive)

Multi-step problems

Prompt: Design a database schema for a ride-sharing app that handles surge pricing, driver ratings, and trip history. Consider the tradeoffs between normalization and query performance.

Fast model: Gives a reasonable but surface-level schema.

Thinking model: Considers multiple approaches, weighs tradeoffs, produces a more thoughtful design with justifications.

Code with complex logic

Prompt: Write a function that determines if a Sudoku board is valid. Handle partial boards (empty cells are valid).

Fast model: Often misses edge cases like duplicate checking within 3x3 boxes.

Thinking model: Systematically checks rows, columns, and boxes.

When Fast Models Win

Thinking models are overkill (and wasteful) for:

Simple generation:

"Write a professional email declining a meeting."
→ Fast model handles this perfectly. No reasoning needed.

Classification:

"Is this review positive or negative: 'Great product, love it!'"
→ Trivial. Thinking adds nothing here.

Formatting and extraction:

"Extract the name, email, and phone from this text."
→ Pattern matching, not reasoning.

Conversation and explanations:

"Explain what a goroutine is."
→ This is recall, not reasoning. Fast model is better AND faster.

The Decision Rule

Does the task require reasoning, planning, or multi-step logic?
  YES → thinking model (o3, o4-mini, DeepSeek-R1)
  NO  → fast model (GPT-4o, Claude Sonnet, Haiku)

More specifically:

Task type	Model type	Examples
Recall / knowledge	Fast	"What is X?", explanations, definitions
Generation	Fast	Emails, code boilerplate, summaries
Classification	Fast	Sentiment, categorization, routing
Extraction	Fast	Parse data from text, JSON extraction
Complex reasoning	Thinking	Math proofs, logic puzzles, architecture
Bug finding	Thinking	"Why does this code fail on edge case X?"
Planning	Thinking	"Design a system that handles..."
Novel problems	Thinking	Anything the model hasn't seen patterns for

The Thinking Models

OpenAI o3 / o4-mini

o3: Most capable reasoning model from OpenAI. Expensive, slow (10 to 30 seconds).
o4-mini: Faster, cheaper, still strong reasoning. Good default when you need thinking but not the absolute best.

DeepSeek-R1

Shows its reasoning chain (you can read the thinking tokens).
Competitive with o3 on math and code benchmarks.
Much cheaper than OpenAI's reasoning models.
Available as open weights, so you can run it locally with Ollama.

ollama run deepseek-r1:14b

Claude with Extended Thinking

Claude Sonnet can be asked to "think step by step." It's not a true thinking model, but it gets some of the benefit by reasoning in its visible output.
Anthropic is working on dedicated reasoning models.

Cost Comparison

Thinking models cost more because they generate more tokens internally:

Task: "Find the bug in this 50-line function"

GPT-4o:      ~500 output tokens, $0.003, 2 seconds
o3:          ~3000 thinking tokens + 500 output, $0.02, 15 seconds
DeepSeek-R1: ~2000 thinking tokens + 500 output, $0.005, 12 seconds

The thinking tokens cost money even though you don't see them in the final response (except with R1 where they're visible). A task that costs $0.003 with a fast model might cost $0.02 with a thinking model, roughly 7x more.

Hybrid Strategy

The smart approach: use fast models by default, escalate to thinking models only when needed.

User asks a question
  → Try with fast model (GPT-4o-mini)
  → If output quality is poor or task is clearly complex
  → Retry with thinking model (o4-mini or R1)

Some systems do this automatically by detecting when a task needs deeper reasoning and routing accordingly. You can do it manually too: start with the fast model, and if the answer feels shallow or wrong, re-ask with a thinking model.

Prompting Differences

Fast models benefit from chain-of-thought prompting ("think step by step") because it forces them to reason in their visible output tokens.

Thinking models already reason internally. Adding "think step by step" is redundant because they're already doing it behind the scenes. Instead, focus on clearly stating the problem and constraints.

Fast model prompt: "Think step by step. What's the time complexity of this function?"

Thinking model prompt: "What's the time complexity of this function?" (It will think step by step automatically)

Key Takeaways

Fast models respond immediately. Best for generation, classification, extraction, and conversation.
Thinking models pause to reason. Best for math, logic, complex code, and planning.
The decision rule: does the task require multi-step reasoning? If yes, use a thinking model.
Thinking models cost 5 to 10x more because of hidden reasoning tokens.
DeepSeek-R1 is the budget thinking model. Competitive quality, much cheaper, and it runs locally.
Don't use thinking models for simple tasks. You're paying for reasoning you don't need.
Hybrid strategy: start fast, escalate to thinking only when quality demands it.