02 - Tokens and Context Windows

What Is a Token

LLMs don't read characters or words. They read tokens. A token is a chunk of text, usually a common word, part of a word, or a punctuation mark. The word "hello" is one token. The word "unbelievable" might be three tokens: "un", "believ", "able".

Text	Tokens	Count
"Hello, world!"	"Hello", ",", " world", "!"	4
"Go is great"	"Go", " is", " great"	3
"tokenization"	"token", "ization"	2
"fmt.Println"	"fmt", ".", "Pr", "intln"	4

Tokenization varies by model. GPT-4 and Llama use different tokenizers, so the same text produces different token counts. But the concept is the same everywhere.

Why Tokens Matter

Tokens are the unit of everything in LLM engineering. You pay per token (cloud APIs). You're limited by tokens (context window). The model thinks in tokens, not words.

Measure	Estimate
1 token	~4 characters
1 token	~0.75 words
100 tokens	~75 words
Go function (10 lines)	50-80 tokens
Full page of text	400-500 tokens

Code is more expensive than prose. Special characters, indentation, and short variable names all consume tokens. A 200-line Go file might use 800-1200 tokens.

Counting Tokens with Ollama

Ollama's API returns token counts in every response. You can see exactly how many tokens your prompt and the response used.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What is Go?"}],
  "stream": false
}' | jq '{prompt_tokens: .prompt_eval_count, response_tokens: .eval_count}'

The response includes token counts:

{
  "prompt_tokens": 32,
  "response_tokens": 156
}

The prompt tokens include the system prompt (if any) plus your message. A system prompt is an instruction you send with every request to control the model's behavior, like "You are a helpful assistant. Be concise." We'll cover system prompts in detail in lesson 05. The response tokens are what the model generated. Both count toward the context window.

Your numbers will differ. Token counts vary by model, model version, and even Ollama version. The exact numbers don't matter. What matters is that you can see them.

The Context Window

Every model has a maximum number of tokens it can handle in a single request. This is the context window. It includes everything: system prompt, conversation history, your message, and the model's response.

Model	Context Window
llama3.2 (3B)	128K tokens
llama3.1 (8B)	128K tokens
GPT-4o	128K tokens
Claude Sonnet	200K tokens
Gemini Pro	1M tokens

These numbers grow with every model release. Check the provider's docs for current limits.

128K tokens sounds like a lot, but it fills up fast in real applications. A chat conversation with 50 back-and-forth messages, each with a paragraph of context, can easily hit 10-20K tokens. A RAG system that injects retrieved documents can use 5-10K tokens per query just for context.

What Happens When You Exceed It

If your input exceeds the context window, the API returns an error. The model can't process it.

Request: 150K tokens → Model limit: 128K tokens → Error

You have three options:
1. Truncate: Cut older messages from the conversation
2. Summarize: Replace old messages with a summary
3. Chunk: Split the input and process it in parts

The model never manages the context window. It just processes whatever you send. Chat apps like ChatGPT and coding agents handle truncation and summarization behind the scenes. When you call the API directly, it's your job.

Input vs Output Tokens

Most APIs distinguish between input tokens (what you send) and output tokens (what the model generates). This matters for two reasons.

Cost: Output tokens are typically 3-4x more expensive than input tokens on cloud APIs. A response that generates 1,000 tokens costs more than a prompt that sends 1,000 tokens.

Limits: You can set a maximum on output tokens to control response length and cost.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What is Go?"}],
  "stream": false,
  "options": {"num_predict": 500}
}'
# num_predict limits the response to 500 tokens
# OpenAI equivalent: "max_tokens": 500

The Lost-in-the-Middle Problem

Models pay more attention to the beginning and end of the context window. Information buried in the middle often gets ignored. This is a known limitation, not a bug.

Context: [Document A] [Document B] [Document C] [Document D] [Document E]
                                    ↑
                          This one gets less attention

If Document C has the answer, the model might miss it.

This matters for RAG systems (lesson 08). When you inject retrieved documents into the prompt, put the most relevant ones first and last, not in the middle.

Practical Token Budgeting

Every request to an LLM packs multiple things into one context window. You need to know how much space each part takes so you don't run out.

Example: Chat app with RAG (128K window)

Component	Tokens
System prompt	~500
Retrieved docs	~3,000
History	~2,000
User message	~100
Response	~1,000
Total	~6,600

That leaves ~121K tokens of room.

That looks comfortable. Now watch what happens when you scale up.

Same app, busier:

Component	Tokens
System prompt	~500
Retrieved docs	~12,000
History	~20,000
User message	~500
Response	~2,000
Total	~35,000

Still fits in 128K, but you're now spending real money on cloud APIs. At $2.50 per million input tokens, that's $0.08 per request. 10,000 requests a day is $800/month just on input.

The system prompt is the sneaky one. It's small, but it ships with every single request. A 500-token system prompt across 10,000 daily requests is 5 million tokens per day.

Key Takeaways

Tokens are chunks of text, roughly 4 characters or 0.75 words each
Code uses more tokens than prose due to special characters and formatting
The context window is the total token limit for input + output combined
Output tokens cost more than input tokens on cloud APIs
Models pay less attention to information in the middle of long contexts
Budget your tokens: system prompt + context + history + response must fit the window
Ollama returns token counts in every API response so you can track usage