10 - Indexing and Denormalization

Why Indexing Matters

Without an index, the database scans every row to find matches. This is called a full table scan. With 10 rows, that's fine. With 10 million rows, it takes seconds.

An index is a separate data structure that maps column values to row locations. Think of it like a book's index: instead of reading every page to find "caching," you look up "caching" in the back and jump to page 47.

B-Tree Indexes

The default index type in PostgreSQL and MySQL. B-trees keep data sorted, which makes them good for:

Equality queries: WHERE email = '[email protected]'
Range queries: WHERE created_at > '2026-01-01'
Sorting: ORDER BY created_at DESC

The database walks the tree from root to leaf, narrowing down candidates at each level. A table with 10 million rows needs roughly 4 tree levels, so any lookup takes about 4 disk reads instead of millions.

Hash Indexes

Hash indexes compute a hash of the column value and store the row location. They're faster than B-trees for exact equality lookups but can't handle range queries or sorting.

Use case: looking up a user by their exact API key or session token. You only ever query by exact match, never by range, so a hash index is ideal.

Most databases default to B-tree because it's more versatile. Use hash indexes only when you're certain you'll never need range queries on that column.

Composite Indexes

An index on multiple columns. The order matters.

An index on (country, city) helps these queries:

WHERE country = 'US' ✅
WHERE country = 'US' AND city = 'Austin' ✅
WHERE city = 'Austin' ❌ (can't skip the first column — the query will still work, the database will happily scan every row to find your answer, it just won't use the index to get there faster)

Think of it like a phone book sorted by last name, then first name. You can find all "Smiths" easily, or "John Smith" specifically, but finding all "Johns" regardless of last name requires scanning everything.

Rule of thumb: put columns used in equality conditions first, then columns used in range conditions. Among equality columns, the most selective one (the one that narrows results the most) first gives the best performance.

The Cost of Indexes

Every index slows down writes. When you insert or update a row, every index on that table must also be updated. More indexes means slower writes.

Indexes also consume disk space. A table with 5 indexes might use more disk for indexes than for the actual data.

The balance:

Index columns you filter or sort by frequently
Don't index columns you rarely query
Monitor slow queries and add indexes where they help
Remove unused indexes (most databases can tell you which indexes are never used)

Denormalization

Normalization splits data across tables to eliminate duplication. A user's name lives in the users table only. The orders table references it via user_id. Clean, but every read requires a join.

Denormalization duplicates data to avoid joins. Store the user's name directly in the orders table. Now reading an order is a single table lookup, no join needed.

Normalized:
  orders: [order_id, user_id, total]
  users:  [user_id, name, email]
  → SELECT orders.*, users.name FROM orders JOIN users ...

Denormalized:
  orders: [order_id, user_id, user_name, total]
  → SELECT * FROM orders WHERE order_id = ...

When to Denormalize

Denormalize when:

Read performance matters more than write complexity
You have a high read-to-write ratio (100:1 or more)
Joins are becoming a bottleneck
You're using a NoSQL database that doesn't support joins

Don't denormalize when:

The duplicated data changes frequently (you'll need to update it everywhere)
Consistency is critical (duplicated data can drift out of sync)
Your dataset is small enough that joins are fast

Most read-heavy web applications denormalize aggressively. Social media feeds, product catalogs, and dashboards all benefit from pre-joined data.

Materialized Views

A middle ground between normalization and denormalization. A materialized view is a precomputed query result stored as a table. The database refreshes it on a schedule or on demand.

You keep your normalized source tables (clean writes) but read from the materialized view (fast reads). The tradeoff: data in the view may be slightly stale between refreshes.

Key Takeaways

Indexes speed up reads but slow down writes and use disk space
B-tree is the default; use hash only for exact-match lookups
Composite index column order matters — equality columns first, then range columns
Denormalization trades write complexity for read performance
Materialized views give you fast reads without fully denormalizing your source data
Don't index everything; don't denormalize everything. Measure first.