21 - Designing for Failure
📋 Jump to TakeawaysEverything Fails
Servers crash. Networks partition. Disks corrupt. Cloud regions go down. The question isn't whether failures happen, but how your system behaves when they do.
A well-designed system degrades gracefully. A poorly designed one cascades into a full outage from a single component failure.
Redundancy
The simplest defense: have more than one of everything.
- Multiple servers behind a load balancer (one dies, others serve traffic)
- Multiple database replicas (primary fails, promote a replica)
- Multiple availability zones (one AZ goes down, others keep running)
- Multiple regions (for critical systems that can't tolerate regional outages)
Redundancy costs money. The level of redundancy should match the cost of downtime.
Circuit Breakers
Like an electrical breaker in your house — when there's a surge, it trips to protect everything downstream.
The Problem
Your checkout service calls the payment service. Payment starts timing out (30s per request instead of 200ms). Without a circuit breaker, here's the cascade:
- Requests pile up — each checkout request opens a connection to payment and waits 30s for a response that never comes
- Resources exhaust — your checkout service has limited threads/connections (say 100). After 100 requests are stuck waiting, there are zero threads left to handle new requests
- Unrelated features break — a user trying to view their order history hits checkout too, but all threads are blocked waiting on payment. The request is rejected even though it doesn't need payment at all
- Failure spreads upstream — the API gateway sees checkout timing out, so it starts queuing requests too. Now the homepage, search, and account pages all slow down
- Full outage — one slow dependency (payment) has taken down the entire platform
Three States
Closed (normal) — everything is healthy. Requests pass through to the downstream service as usual. Behind the scenes, the breaker is counting failures (timeouts, 5xx errors). As long as failures stay below the threshold, you'd never know the breaker exists.
Open (tripped) — the failure count crossed the threshold (e.g., 5 failures in 10 seconds). The breaker stops calling the downstream service entirely. Every incoming request gets an instant error or fallback response in 2ms instead of waiting 30s for a timeout. This protects your thread pool and prevents the cascade described above.
Half-Open (testing recovery) — after a cooldown period (e.g., 30 seconds), the breaker lets 1-2 requests through to test if the downstream service has recovered. If those test requests succeed → the breaker closes and normal traffic resumes. If they fail → the breaker stays open and waits another cooldown period before trying again.
Example Timeline
Payment service starts failing at 2:00 PM
2:00:01 — request 1 → timeout (30s). Failure count: 1
2:00:02 — request 2 → timeout (30s). Failure count: 2
...
2:00:05 — request 5 → timeout (30s). Failure count: 5 → THRESHOLD HIT
🔴 Circuit OPENS
2:00:06 — request 6 → instant failure (2ms). No call to payment.
2:00:07 — request 7 → instant failure (2ms). "Try again later."
...thousands of requests fail fast instead of waiting 30s each...
⏱️ 30s cooldown expires → 🟡 HALF-OPEN
2:00:35 — test request → success! Payment recovered.
🟢 Circuit CLOSES — normal traffic resumesWhat Happens When Open?
- Return an error: "Payment temporarily unavailable"
- Use a fallback: queue the order for later processing
- Serve cached data (for read operations)
Typical Config
| Setting | Value | Purpose |
|---|---|---|
| Failure threshold | 5 in 10s | When to trip |
| Open duration | 30-60s | How long to stay open |
| Half-open requests | 1-3 | Test requests before closing |
The circuit breaker is what prevents "one bad service takes down everything."
Timeouts and Retries
Every network call needs a timeout. Without one, a hung connection blocks resources indefinitely.
Timeouts — set aggressive timeouts. It's better to fail fast and retry than to wait 30 seconds for a response that may never come.
Retries — retry failed requests, but with:
- Exponential backoff — wait 1s, then 2s, then 4s. Don't hammer a struggling service.
- Jitter — add randomness to backoff. Prevents all clients from retrying at the same instant.
- Max retries — cap the number of attempts. After 3 failures, give up and return an error.
Without backoff and jitter, retries can cause a retry storm that makes the outage worse.
Bulkheads
Named after ship compartments that prevent a hull breach from sinking the entire vessel. In software, bulkheads isolate failures to one part of the system.
Examples:
- Separate thread pools per dependency. If the payment service is slow, it exhausts its own thread pool without affecting the search service.
- Separate databases per service. One service's database overload doesn't affect others.
- Separate deployment groups — deploy to a canary group first. If it fails, only a small percentage of users are affected.
Graceful Degradation
When a component fails, serve a reduced experience instead of a complete failure.
Examples:
- Recommendation service is down? Show popular items instead of personalized ones.
- Search is slow? Return cached results from the last successful query.
- Payment processor is down? Let users add to cart but disable checkout with a clear message.
The user gets a worse experience, but they get an experience. That's better than a 500 error page.
Chaos Engineering
Test your failure handling by intentionally breaking things in production.
Netflix pioneered this with Chaos Monkey (now retired), which randomly killed production instances. They've since evolved to more sophisticated tools (ChAP — Chaos Automation Platform). The principle remains: if the system handles random failures gracefully, your redundancy works. If it doesn't, you found a gap before your users did.
Start small:
- Kill a single instance and verify traffic reroutes
- Introduce network latency between services
- Simulate a database failover
- Take down an entire availability zone
Tools: Gremlin, Litmus (Kubernetes-native), AWS Fault Injection Simulator. Start with non-critical services in staging before running experiments in production.
Only do this if you have confidence in your monitoring and rollback capabilities. You need to detect the failure quickly and stop the experiment if it causes unexpected impact.
Key Takeaways
- Assume everything will fail. Design for it from the start.
- Circuit breakers prevent cascading failures from slow dependencies
- Always set timeouts. Always use exponential backoff with jitter for retries.
- Bulkheads isolate failures so one bad component doesn't sink the system
- Degrade gracefully: a reduced experience beats a broken one
- Test failure handling with chaos engineering before real failures test it for you