05. Model-Specific Prompting

Why It Matters

The universal principles from lesson 04 get you 80% of the way. But each model family has quirks, strengths, and specific features that let you get better results when you know how to use them.

This lesson covers the tricks that only work (or work best) on specific models.

GPT (OpenAI)

Structured Output

GPT's biggest edge is structured output. You can force it to return valid JSON matching a schema every single time. No parsing errors, no "here's the JSON wrapped in markdown."

Respond in JSON with this exact schema:
{
  "summary": "string",
  "sentiment": "positive | negative | neutral",
  "confidence": "number between 0 and 1"
}

GPT-4o follows JSON schemas more reliably than any other model. If your task needs structured data extraction, GPT is the default choice.

Function Calling

GPT models support function calling natively. You describe available functions (name, parameters, types) in your API request, and the model decides when to call them and with what arguments. The model doesn't execute anything — it returns a structured JSON object saying "call this function with these values," and your code handles the actual execution. This is how most GPT-based agents work under the hood.

{
  "name": "get_weather",
  "parameters": {
    "type": "object",
    "properties": {
      "city": { "type": "string" },
      "unit": { "enum": ["celsius", "fahrenheit"] }
    }
  }
}

The model won't hallucinate function names or parameters if you define them clearly.

GPT Tips

Use the response_format parameter to enforce JSON output at the API level
GPT-4o-mini follows the same structured output format as GPT-4o, so prototype with mini first
For classification tasks, give GPT a numbered list of categories and ask it to return just the number

Claude (Anthropic)

Instruction Following

Claude's strength is doing exactly what you tell it. If you say "respond in exactly 3 bullet points," it will. If you say "do not mention X," it won't. This makes Claude excellent for tasks where format precision matters.

XML Tags

Claude responds particularly well to XML-style tags for structuring input. This is an Anthropic-specific trick that helps Claude parse complex prompts with multiple sections.

<context>
You are reviewing a pull request for a Go microservice.
The service handles payment processing.
</context>

<code>
func processPayment(amount float64) error {
    // ... code here
}
</code>

<task>
Find security vulnerabilities in this code.
Return each issue as a bullet point with severity (high/medium/low).
</task>

Claude treats content inside XML tags as distinct sections. This reduces confusion when your prompt has multiple parts that could bleed into each other.

Long Document Handling

Claude handles long context well (200K tokens). When working with long documents, put the document first and your question last. Claude pays more attention to the end of the prompt.

Here is the full contract:
[... 50 pages of legal text ...]

Based on the above, what are the termination clauses?

Claude Tips

Put instructions at the end of the prompt when working with long documents
Use XML tags to separate different parts of complex prompts
Claude is conservative by default. If you want creative or bold output, say so explicitly
Claude tends to add more comments and explanations in code than GPT does. Add "no comments in the code" if you want clean output

Gemini (Google)

Massive Context

Gemini 2.5 Pro handles 1 million tokens. That's roughly 700,000 words or an entire codebase. The key is knowing how to use that much context effectively.

Here is my entire project (47 files):
[... all source code ...]

Questions:
1. What does the auth middleware do?
2. Where is the database connection configured?
3. Are there any unused imports across the project?

You can dump everything in and ask multiple questions in one shot. With other models you'd need to split this into multiple requests.

Multimodal

Gemini handles images natively. You can pass screenshots, diagrams, or photos alongside text.

[image: screenshot of error in browser]

What is causing this error? Here is the relevant component code:
[... code ...]

Gemini Tips

For large codebases, include a file tree at the top so Gemini can navigate the context
Gemini sometimes gives longer responses than needed. Add "be concise" or set a word limit
When using multimodal, describe what the image shows if the model seems to miss details

DeepSeek

Visible Reasoning

DeepSeek-R1 shows its thinking process. You can read the reasoning chain, which is useful for understanding why the model reached a specific conclusion and whether it went down the wrong path.

User: Is this function thread-safe? [code]

DeepSeek-R1 thinking:
"Let me check for shared state... The counter variable is accessed
without a mutex... Multiple goroutines could increment simultaneously...
This is a race condition."

Answer: No, this function is not thread-safe. The counter variable
is accessed without synchronization...

If the model gives a wrong answer, you can read the thinking and see exactly where the reasoning broke down.

Cost Advantage

DeepSeek-V3 is one of the cheapest cloud APIs available. For tasks where you need decent quality at high volume, it's worth testing against GPT-4o-mini to see which gives you better results per dollar.

DeepSeek Tips

R1's visible thinking helps you debug bad outputs. You can see exactly where the reasoning went wrong
DeepSeek models are strong at math and code but can be weaker at nuanced English writing
For local use, deepseek-r1:14b is a good balance of quality and speed on consumer hardware

Llama and Qwen (Local Models)

Prompt Format Matters

Local models are sensitive to prompt format. Each model family expects a specific chat template. Ollama handles this automatically, but if you're using raw inference, you need the right format.

Llama 3 format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is 2+2?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

If you use Ollama, you don't need to worry about this. Ollama applies the correct template automatically.

Quantization Tradeoffs

Local models come in different quantization levels (Q4, Q5, Q8, F16). Lower quantization means smaller file size and faster inference, but slightly lower quality.

Quantization	Size (7B model)	Quality	Speed
Q4_K_M	~4 GB	Good for most tasks	Fast
Q5_K_M	~5 GB	Slightly better	Moderate
Q8_0	~7 GB	Near full quality	Slower
F16	~14 GB	Full quality	Slowest

For most tasks, Q4_K_M is the sweet spot. You only notice quality loss on complex reasoning tasks.

Local Model Tips

Start with Q4 quantization. Only upgrade if you see quality issues on your specific task
Qwen 2.5 7B performs surprisingly well for code tasks relative to its size
For simple extraction and classification, even a 3B model works fine
Set temperature to 0 for deterministic output on classification tasks

Key Takeaways

GPT excels at structured output and function calling. Use it when you need reliable JSON.
Claude excels at instruction following and long documents. Use XML tags for complex prompts.
Gemini excels at massive context and multimodal. Dump entire codebases in one request.
DeepSeek-R1 shows its reasoning, which helps you debug wrong answers. Cheapest thinking model available.
Local models need the right quantization. Q4 is usually good enough for everything except complex reasoning.
Learn the strengths of each model and route tasks accordingly instead of using one model for everything.