04 - Scaling and Bottlenecks

Vertical vs Horizontal Scaling

Every system hits a limit. When your single server can't handle the load, you have two choices.

Vertical scaling means getting a bigger machine. More CPU, more RAM, more disk. It's simple — no code changes needed. But there's a ceiling. You can't buy a server with 10,000 cores.

Horizontal scaling means adding more machines. Instead of one powerful server, you run 50 smaller ones. This is harder (your app needs to handle distributed state), but there's no ceiling. Need more capacity? Add more boxes.

Most production systems use horizontal scaling. It's cheaper at scale and gives you redundancy for free — if one machine dies, the others keep running.

Identifying Bottlenecks

A bottleneck is the slowest part of your system. Everything else waits for it.

Common bottlenecks:

CPU — computation-heavy tasks (video encoding, ML inference)
Memory — large datasets held in RAM (caches, in-memory databases)
Disk I/O — reading/writing to storage (database queries, file uploads)
Network — data transfer between services or to clients (API calls, streaming)
Database — the most common bottleneck in monolithic applications
Service dependencies — in microservices, the most common bottleneck. Your service calls another service, which calls another. If any service in the chain is slow or at capacity, everything upstream blocks. Three services chained at 100ms each means 300ms minimum, and if one hits its limit, your service queues up waiting

The fix depends on what's bottlenecked. CPU-bound? Add more compute nodes. Database-bound? Add read replicas or caching. Network-bound? Use CDNs or compress payloads. Blocked on downstream services? Add timeouts, circuit breakers (mechanisms that stop calling a failing service — covered in Lesson 21), or cache their responses.

A Concrete Example

You launch a product page. It works fine with 100 users. Then a blog post goes viral and you get 10,000 concurrent users. The page takes 8 seconds to load. Where's the bottleneck?

Step 1: Check the application server. CPU is at 30%, memory is fine, disk I/O is low, network bandwidth isn't saturated. The app server isn't the bottleneck.

Step 2: Check the database. Connection pool is maxed out. Queries are queuing. Average query time jumped from 5ms to 2 seconds. Found it.

Step 3: Fix it. Short-term: add a cache (Redis) in front of the database for product data that doesn't change often. 90% of requests now hit the cache and never reach the database.

Step 4: Plan for next time. Add read replicas so the remaining 10% of queries are spread across multiple database instances.

The lesson: don't guess. Measure each layer, find the actual bottleneck, then fix that specific layer.

Throughput vs Latency

Two metrics that matter for every system:

Throughput is how many requests your system handles per unit of time. "10,000 requests per second."

Latency is how long a single request takes. "p99 latency is 200ms" means 99% of requests complete within 200ms.

They're related but not the same. You can have high throughput with high latency (batch processing). Or low latency with low throughput (a single fast server that can't handle many concurrent users).

In system design, you usually optimize for one while keeping the other within acceptable bounds.

Stateless vs Stateful Services

A stateless service doesn't remember anything between requests. Every request contains all the information needed to process it. These are easy to scale horizontally — just add more instances behind a load balancer.

A stateful service stores data between requests (sessions, connections, caches). These are harder to scale because you need to route users to the same instance, or share state across instances.

The pattern: keep your application servers stateless. Push state into dedicated stores (databases, Redis, object storage). This lets you scale the compute layer independently from the storage layer.

What about user sessions? Store them in Redis, not in server memory. Any server can then handle any user's request by reading the session from Redis (sub-millisecond). If you store sessions on the server itself, you need sticky routing (same user always hits the same server), and if that server dies, the session is gone.

Key Takeaways

Vertical scaling is simpler but has limits; horizontal scaling is harder but unlimited
Find the bottleneck before optimizing — don't guess
Throughput and latency are different concerns that require different solutions
Stateless services scale easily; push state into dedicated stores