05. Model-Specific Prompting
📋 Jump to TakeawaysWhy It Matters
The universal principles from lesson 04 get you 80% of the way. But each model family has quirks, strengths, and specific features that let you get better results when you know how to use them.
This lesson covers the tricks that only work (or work best) on specific models.
GPT (OpenAI)
Structured Output
GPT's biggest edge is structured output. You can force it to return valid JSON matching a schema every single time. No parsing errors, no "here's the JSON wrapped in markdown."
Respond in JSON with this exact schema:
{
"summary": "string",
"sentiment": "positive | negative | neutral",
"confidence": "number between 0 and 1"
}GPT-4o follows JSON schemas more reliably than any other model. If your task needs structured data extraction, GPT is the default choice.
Function Calling
GPT models support function calling natively. You describe available functions (name, parameters, types) in your API request, and the model decides when to call them and with what arguments. The model doesn't execute anything — it returns a structured JSON object saying "call this function with these values," and your code handles the actual execution. This is how most GPT-based agents work under the hood.
{
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" },
"unit": { "enum": ["celsius", "fahrenheit"] }
}
}
}The model won't hallucinate function names or parameters if you define them clearly.
GPT Tips
- Use the
response_formatparameter to enforce JSON output at the API level - GPT-4o-mini follows the same structured output format as GPT-4o, so prototype with mini first
- For classification tasks, give GPT a numbered list of categories and ask it to return just the number
Claude (Anthropic)
Instruction Following
Claude's strength is doing exactly what you tell it. If you say "respond in exactly 3 bullet points," it will. If you say "do not mention X," it won't. This makes Claude excellent for tasks where format precision matters.
XML Tags
Claude responds particularly well to XML-style tags for structuring input. This is an Anthropic-specific trick that helps Claude parse complex prompts with multiple sections.
<context>
You are reviewing a pull request for a Go microservice.
The service handles payment processing.
</context>
<code>
func processPayment(amount float64) error {
// ... code here
}
</code>
<task>
Find security vulnerabilities in this code.
Return each issue as a bullet point with severity (high/medium/low).
</task>Claude treats content inside XML tags as distinct sections. This reduces confusion when your prompt has multiple parts that could bleed into each other.
Long Document Handling
Claude handles long context well (200K tokens). When working with long documents, put the document first and your question last. Claude pays more attention to the end of the prompt.
Here is the full contract:
[... 50 pages of legal text ...]
Based on the above, what are the termination clauses?Claude Tips
- Put instructions at the end of the prompt when working with long documents
- Use XML tags to separate different parts of complex prompts
- Claude is conservative by default. If you want creative or bold output, say so explicitly
- Claude tends to add more comments and explanations in code than GPT does. Add "no comments in the code" if you want clean output
Gemini (Google)
Massive Context
Gemini 2.5 Pro handles 1 million tokens. That's roughly 700,000 words or an entire codebase. The key is knowing how to use that much context effectively.
Here is my entire project (47 files):
[... all source code ...]
Questions:
1. What does the auth middleware do?
2. Where is the database connection configured?
3. Are there any unused imports across the project?You can dump everything in and ask multiple questions in one shot. With other models you'd need to split this into multiple requests.
Multimodal
Gemini handles images natively. You can pass screenshots, diagrams, or photos alongside text.
[image: screenshot of error in browser]
What is causing this error? Here is the relevant component code:
[... code ...]Gemini Tips
- For large codebases, include a file tree at the top so Gemini can navigate the context
- Gemini sometimes gives longer responses than needed. Add "be concise" or set a word limit
- When using multimodal, describe what the image shows if the model seems to miss details
DeepSeek
Visible Reasoning
DeepSeek-R1 shows its thinking process. You can read the reasoning chain, which is useful for understanding why the model reached a specific conclusion and whether it went down the wrong path.
User: Is this function thread-safe? [code]
DeepSeek-R1 thinking:
"Let me check for shared state... The counter variable is accessed
without a mutex... Multiple goroutines could increment simultaneously...
This is a race condition."
Answer: No, this function is not thread-safe. The counter variable
is accessed without synchronization...If the model gives a wrong answer, you can read the thinking and see exactly where the reasoning broke down.
Cost Advantage
DeepSeek-V3 is one of the cheapest cloud APIs available. For tasks where you need decent quality at high volume, it's worth testing against GPT-4o-mini to see which gives you better results per dollar.
DeepSeek Tips
- R1's visible thinking helps you debug bad outputs. You can see exactly where the reasoning went wrong
- DeepSeek models are strong at math and code but can be weaker at nuanced English writing
- For local use,
deepseek-r1:14bis a good balance of quality and speed on consumer hardware
Llama and Qwen (Local Models)
Prompt Format Matters
Local models are sensitive to prompt format. Each model family expects a specific chat template. Ollama handles this automatically, but if you're using raw inference, you need the right format.
Llama 3 format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is 2+2?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>If you use Ollama, you don't need to worry about this. Ollama applies the correct template automatically.
Quantization Tradeoffs
Local models come in different quantization levels (Q4, Q5, Q8, F16). Lower quantization means smaller file size and faster inference, but slightly lower quality.
| Quantization | Size (7B model) | Quality | Speed |
|---|---|---|---|
| Q4_K_M | ~4 GB | Good for most tasks | Fast |
| Q5_K_M | ~5 GB | Slightly better | Moderate |
| Q8_0 | ~7 GB | Near full quality | Slower |
| F16 | ~14 GB | Full quality | Slowest |
For most tasks, Q4_K_M is the sweet spot. You only notice quality loss on complex reasoning tasks.
Local Model Tips
- Start with Q4 quantization. Only upgrade if you see quality issues on your specific task
- Qwen 2.5 7B performs surprisingly well for code tasks relative to its size
- For simple extraction and classification, even a 3B model works fine
- Set temperature to 0 for deterministic output on classification tasks
Key Takeaways
- GPT excels at structured output and function calling. Use it when you need reliable JSON.
- Claude excels at instruction following and long documents. Use XML tags for complex prompts.
- Gemini excels at massive context and multimodal. Dump entire codebases in one request.
- DeepSeek-R1 shows its reasoning, which helps you debug wrong answers. Cheapest thinking model available.
- Local models need the right quantization. Q4 is usually good enough for everything except complex reasoning.
- Learn the strengths of each model and route tasks accordingly instead of using one model for everything.