01 - What Are LLMs

A Prediction Machine

A Large Language Model is a program that predicts the next word. You give it "The capital of France is" and it predicts "Paris." That's it. Every chat response, every code suggestion, every essay is just next-word prediction running in a loop.

Input:  "The capital of France is"
Output: "Paris"

Input:  "Write a function that reverses a string"
Output: "func reverse(s string) string { ... }"

The model doesn't "know" anything. It learned statistical patterns from massive amounts of text during training. Correct answers come from matching patterns in the training data. Hallucinations come from plausible but wrong pattern completions.

How They Generate Text

The model doesn't produce the entire response at once. It generates one token at a time, feeds that token back in, and generates the next one. This autoregressive loop continues until the model produces a stop token or hits a length limit.

Step 1: "The best way to learn Go is" → "to"
Step 2: "The best way to learn Go is to" → "write"
Step 3: "The best way to learn Go is to write" → "code"
Step 4: "The best way to learn Go is to write code" → "."
Step 5: "The best way to learn Go is to write code." → [STOP]

Each step, the model considers every possible next token and assigns a probability. "to" might get 0.35, "by" might get 0.25, "through" might get 0.15. The model samples from this distribution, which is why the same prompt can produce different responses.

Training Phases

LLMs go through distinct training phases. Each one shapes different capabilities.

Pre-training is the expensive part. The model reads billions of pages of text and learns language patterns, facts, reasoning structures, and code. This takes weeks on thousands of GPUs and costs millions of dollars. You don't do this yourself.

Fine-tuning takes a pre-trained model and trains it further on a smaller, curated dataset. A base model becomes a chat model through fine-tuning. The model learns to follow instructions, answer questions, and refuse harmful requests.

RLHF (Reinforcement Learning from Human Feedback) is the polish step. Humans rank model outputs from best to worst, and the model learns to prefer the higher-ranked responses. ChatGPT feels helpful and conversational instead of just completing text because of this step.

Base model:     "The president of the US" → "is the head of state..."
                (continues like Wikipedia)

Chat model:     "Who is the president?" → "As of my last update..."
                (answers like a helpful assistant)

Models You Can Run Locally

You don't need an API key or a cloud account to work with LLMs. Ollama lets you run models on your own machine.

# Install Ollama
brew install ollama

# Pull a model
ollama pull llama3.2

# Chat with it
ollama run llama3.2

Popular local models:

llama3.2 (3B) ~2GB. General tasks, fast responses
llama3.1 (8B) ~4.7GB. Better reasoning, still fast
deepseek-r1 (7B) ~4.7GB. Code and complex reasoning
mistral (7B) ~4.1GB. Good all-rounder
nomic-embed-text ~274MB. Embeddings (lesson 07)

The number in parentheses is the parameter count. More parameters generally means better quality but slower responses and more memory.

Cloud Providers

For production or when you need the best models, cloud providers offer API access.

OpenAI: GPT-4o, GPT-4o-mini
Anthropic: Claude Sonnet, Claude Haiku
Google: Gemini Pro, Gemini Flash

All of them use the same basic interface: send messages in, get text out. The code patterns you learn with Ollama transfer directly to any cloud provider. We'll use Ollama throughout this course, with comments showing the cloud equivalent where relevant.

What LLMs Are Bad At

Understanding what LLMs can't do is as important as knowing what they can.

Math: They predict tokens, not compute equations. "What is 7,391 × 4,208?" will often produce a wrong answer because the model is pattern-matching, not calculating.

Recent events: The model's knowledge stops at its training cutoff. It doesn't know what happened yesterday.

Consistency: Ask the same question twice, get different answers. The sampling process is inherently random.

Long documents: Models have a context window (covered in the next lesson). Information in the middle of a long input often gets ignored.

Citing sources: Models generate plausible text, not verified facts. They'll confidently cite papers that don't exist.

Key Takeaways

LLMs predict the next token, one at a time, in a loop
Training has three phases: pre-training (language patterns), fine-tuning (instruction following), RLHF (quality polish)
Ollama lets you run models locally with no API keys or costs
More parameters generally means better quality but slower and more memory
LLMs are bad at math, recent events, consistency, and citing real sources
The same code patterns work for local models (Ollama) and cloud providers (OpenAI, Anthropic)