19 - Rate Limiting

Why Rate Limit

Without rate limiting, a single user (or bot) can overwhelm your system. Rate limiting protects your services from:

Abuse — bots scraping your data or spamming endpoints
DDoS — distributed attacks flooding your servers
Noisy neighbors — one heavy user degrading performance for everyone else
Cost control — preventing runaway API usage that inflates your bill

Common Algorithms

Token Bucket — imagine a bucket that holds up to 100 tokens. Tokens refill at 10 per second. Each request costs one token. If the bucket has tokens, the request goes through. If empty, rejected.

The bucket has a max capacity — tokens stop accumulating once it's full. A user idle for 10 seconds or 10 days still has at most 100 tokens. This caps the maximum burst size. They can fire 100 requests instantly (burst), after which they're limited to 10/sec (the refill rate). To limit burst size, lower the bucket capacity. This matches real user behavior — humans are bursty (load a page, click around, then idle).

Leaky Bucket — requests enter a queue and are processed at a fixed rate, like water leaking from a hole in the bottom. If the queue is full, new requests are dropped. The queue absorbs input bursts, but the output rate is always smooth and constant — no bursts reach the downstream service.

Use token bucket when you want to allow bursts (most APIs). Use leaky bucket when you need a smooth, predictable output rate (e.g., sending requests to a third-party API that can't handle spikes).

Fixed Window — divide time into windows (e.g., 1-minute blocks). Count requests per window. Limit: 100 per window. Simple to implement (one counter per window in Redis). But it has a boundary problem: a user sends 100 requests at 0:59 and 100 more at 1:00. Each window sees 100 (within limit), but the user effectively sent 200 in 2 seconds.

Sliding Window Log — store the timestamp of every request. To check the limit, count timestamps in the last 60 seconds. No boundary problem (it's always "the last 60 seconds," not "this minute"). But storing every timestamp uses a lot of memory at high traffic.

Sliding Window Counter — a hybrid. Keep counters for the current and previous window. Estimate the current rate using a weighted average: previous_window_count × overlap_percentage + current_window_count. Almost as accurate as the log, but uses only two counters instead of thousands of timestamps. Example: previous minute had 80 requests, we're 30 seconds into the current minute (50% overlap), current minute has 40 requests → estimated rate = 80 × 0.5 + 40 = 80.

Token bucket is the most widely used. It's simple, memory-efficient, and handles bursts naturally.

Where to Rate Limit

At the API gateway — the most common placement. Catches abuse before it reaches your services. Centralized configuration.

Per-service — each service enforces its own limits. Useful when different services have different capacity.

At the client — the client self-throttles. Cooperative but not enforceable (malicious clients ignore it).

At the load balancer — basic connection-level limiting. Good for DDoS protection but too coarse for application-level limits.

Layer them. Gateway handles global limits. Individual services handle their own capacity limits.

Rate Limit Keys

What do you rate limit by?

User/API key — most common. Each authenticated user gets their own quota.
IP address — useful for unauthenticated endpoints. But shared IPs (corporate NATs, VPNs) can cause false positives.
Endpoint — different limits for different endpoints. Login gets stricter limits than read-only endpoints.
Combination — per-user per-endpoint. "User X can call /search 10 times per minute."

Distributed Rate Limiting

With multiple servers, you need shared state for rate limit counters. Options:

Redis — the standard choice. Atomic increment operations, TTL for window expiration, fast enough for the hot path. Use INCR with EXPIRE for fixed windows, or Redis + Lua scripts for token bucket.

Local + sync — each server tracks locally and periodically syncs. Faster (no network hop) but less accurate. Acceptable when approximate limiting is fine.

Response Headers

Tell clients about their rate limit status:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1684395600
Retry-After: 30

When a request is rejected, return 429 Too Many Requests with a Retry-After header. Good clients will back off. Bad clients get blocked at a lower layer.

Key Takeaways

Token bucket is the go-to algorithm: simple, efficient, allows bursts
Rate limit at the gateway for global protection, per-service for capacity
Use Redis for distributed rate limit counters
Return clear headers so clients know their quota status
Layer rate limiting: gateway + service + IP-based for defense in depth