07. Context Management

📋 Jump to Takeaways

The Problem

Every model has a context window: the maximum amount of text it can see at once. Everything in that window (system prompt, conversation history, your current message, and the model's response) has to fit.

When you run out of context, the model either refuses to respond or starts forgetting earlier parts of the conversation. Managing context well means getting better results while using fewer tokens.

Context Window Sizes (2025)

Model Context Window
GPT-4o 128K tokens
GPT-4o-mini 128K tokens
Claude Sonnet 200K tokens
Gemini 2.5 Pro 1M tokens
Llama 3.1 8B 128K tokens
Qwen 2.5 7B 32K tokens
DeepSeek-V3 128K tokens

One token is roughly 4 characters or 3/4 of a word in English. So 128K tokens is about 96,000 words, and 1M tokens is about 750,000 words.

What Eats Your Context

In a typical API call, context is consumed by:

System prompt:        ~500 tokens (your instructions)
Conversation history: ~2,000 to 20,000 tokens (previous messages)
Current input:        ~100 to 5,000 tokens (what you're asking now)
Model response:       ~500 to 4,000 tokens (the answer)
─────────────────────────────────────
Total:                Varies, but fills up fast in long conversations

Note: ~500 tokens is typical for a system prompt you write for your own API calls. AI coding agents (Cursor, Claude Code, etc.) use much larger system prompts — often 4,000 to 10,000+ tokens — because they include tool definitions, rules, and detailed behavioral instructions.

The conversation history is usually the biggest consumer. Every message you've sent and every response the model gave stays in context until you clear it or it gets truncated.

Strategies for Managing Context

1. Trim Conversation History

Don't send the entire conversation every time. Keep only what's relevant.

Strategy: Sliding window
- Keep the system prompt (always)
- Keep the last N messages (e.g., last 10 turns)
- Drop everything older

Strategy: Summarize and compress
- After every 10 turns, summarize the conversation so far
- Replace the full history with the summary
- Continue with summary + recent messages

2. Be Selective About What You Include

Don't paste entire files when you only need a function. Don't include background context the model doesn't need for the current question.

❌ "Here's my entire 3000-line file. What does line 47 do?"
✅ "Here's lines 40 to 55. What does the processOrder function do?"

3. Use Retrieval Instead of Stuffing

Instead of putting everything in context, retrieve only what's relevant for each query. This is what RAG (Retrieval-Augmented Generation) does:

User asks: "How do I configure the database?"

Instead of: Stuffing all documentation into context
Do: Search your docs for "database configuration",
    retrieve the 2 to 3 most relevant sections,
    include only those in context

4. Structure Context with Clear Boundaries

When you do include multiple pieces of context, label them clearly so the model knows what's what and doesn't confuse one section for another.

## Current File: src/auth/middleware.go
[code here]

## Error Log:
[error output here]

## Question:
Why is the middleware returning 401 for valid tokens?

The Attention Problem

Even within the context window, models don't pay equal attention to everything. Research shows that models attend most strongly to:

  1. The beginning of the context (system prompt)
  2. The end of the context (your current question)
  3. Less attention to the middle

This is called the "lost in the middle" problem. If you bury important information in the middle of a long context, the model is more likely to miss it or give it less weight.

What this means in practice: Put the most important information at the beginning or end. Put reference material in the middle.

Good structure for long context:
1. System prompt (beginning, high attention)
2. Reference material / documents (middle, lower attention)
3. Your specific question (end, high attention)

Context in Multi-Turn Conversations

In a chat interface, every turn adds to the context. A 20-message conversation might use 15,000 tokens of context just for history. Here's how to manage it:

Start fresh when the topic changes. If you were discussing database design and now want to talk about CSS, start a new conversation. The old context is just noise that dilutes the model's focus.

Summarize when conversations get long. After 15 to 20 turns, the model's quality degrades because it's juggling too much information. Ask it to summarize the key decisions so far, then start a new conversation with that summary.

Front-load context in the first message. Give the model everything it needs upfront rather than drip-feeding information across multiple turns.

❌ Turn 1: "I have a bug"
   Turn 2: "It's in Go"
   Turn 3: "Here's the function"
   Turn 4: "The error is..."

✅ Turn 1: "I have a bug in this Go function. Here's the code and the error:
   [code] [error]. Expected behavior: [X]. Actual behavior: [Y]."

Context for Agents

AI agents (Kiro, Cursor, etc.) manage context automatically, but understanding how helps you work with them better:

  • File context: The agent reads files into context. More files means less room for conversation.
  • Tool results: When an agent runs a command or reads a file, the output goes into context.
  • Conversation history: Your previous messages and the agent's responses accumulate.

When an agent starts giving worse answers deep into a session, it's usually a context problem. The window is full of old information and there's less room for the model to reason about your current question.

Fix: Start a new session, or be explicit about what context matters for your current question.

Key Takeaways

  • Every model has a finite context window. Everything (system prompt, history, input, output) must fit.
  • Conversation history is the biggest context consumer. Trim or summarize it regularly.
  • Models pay most attention to the beginning and end of context. Put important info there.
  • Don't paste entire files. Include only what's relevant to the current question.
  • Use retrieval (RAG) instead of stuffing all possible context into every request.
  • Start fresh conversations when the topic changes. Old context is noise.
  • When agents give worse answers over time, it's usually a context problem. Start a new session.

📝 Ready to test your knowledge?

Answer the quiz below to mark this lesson complete.

© 2026 ByteLearn.dev. Free courses for developers. · Privacy