01 - The System Design Process

Why a Process Matters

System design questions are open-ended. There's no single correct answer. What interviewers (and real projects) care about is how you think through the problem. A structured process keeps you from rambling or jumping to solutions too early.

The Four Steps

Every system design follows the same loop:

Step 1: Clarify Requirements

Don't start drawing boxes. Ask questions first.

Functional requirements — what does the system do? What are the core features? What can you leave out?

Non-functional requirements — how many users? What latency is acceptable? Does it need to be globally distributed? What's the consistency model?

Note: The consistency model defines whether all users see the same data at the same time (strong consistency) or whether the system allows temporary staleness for better performance and availability (eventual consistency). We'll learn more about this in the CAP Theorem lesson.

Constraints — budget, team size, existing infrastructure, timeline.

Spend 3-5 minutes here in an interview. In real life, spend days. Getting requirements wrong means building the wrong system.

Step 2: Estimate Scale

Before choosing components, understand the numbers. How many requests per second? How much storage? How much bandwidth?

This tells you whether a single PostgreSQL instance is enough or whether you need sharding. Whether you need a CDN or not. Whether caching is critical or optional.

We cover estimation in detail in the next lesson.

Step 3: Design the High-Level Architecture

Draw the major components and how they connect. Start with the simplest design that meets the requirements:

Client → Load Balancer → Application Servers → Database

Then add complexity only where the requirements demand it. Need low latency for reads? Add a cache. Need to handle write spikes? Add a queue. Need global reach? Add a CDN.

Every component you add should solve a specific problem. If you can't explain why it's there, remove it.

Step 4: Deep Dive

Pick the most interesting or challenging component and design it in detail. This is where you show depth — database schema design, REST API endpoints, data models, indexing strategies, or retry logic.

In an interview, the interviewer will often guide you: "Tell me more about how you'd handle the database layer" or "What happens when this service fails?"

In real life, this is where you prototype, benchmark, and validate assumptions.

Tips for deep dives:

Pick the hardest or riskiest component — that's what interviewers want to hear about
Be concrete: name specific data structures, algorithms, or tools
Discuss failure cases: what happens when this component goes down?
Use numbers if relevant to show you understand the scale
Show depth on one thing rather than surface-level coverage of everything

Worked Example: Design a Notification System

Let's walk through the four steps with a real problem. You're asked: "Design a system that sends notifications to users (push, email, SMS)."

Step 1: Requirements

Functional:

Send push notifications, emails, and SMS to users
Users can set preferences (opt out of email, mute at night)
Support scheduled notifications (send at 9am local time)
Track delivery status (sent, delivered, failed)

Non-functional:

10 million notifications per day
Notifications should arrive within 30 seconds of being triggered
99.9% delivery rate (retry failed deliveries, fall back to alternate providers; accept that some — invalid numbers, uninstalled apps — will fail permanently)
Must not send duplicates

What we're NOT building: the notification content itself. Other services trigger notifications. We just deliver them.

Step 2: Estimate

There are 86,400 seconds in a day — roughly 100K. This is a handy shortcut for back-of-envelope calculations.

10M notifications/day ÷ 100K seconds = ~100 notifications/sec average
Peak (3x): ~300/sec
Storage per notification: ~500 bytes
Daily storage: 10M × 500B = 5 GB/day

Field	Approximate Size
notification_id (UUID)	16 bytes
user_id	16 bytes
channel (push/email/sms)	5 bytes
status (sent/delivered/failed)	10 bytes
title/subject	~100 bytes
body/message snippet	~200 bytes
timestamps (created, sent, delivered)	24 bytes
metadata (device token, email address, etc.)	~100 bytes

Total: ~470 bytes → rounded to ~500 bytes

300/sec is modest. A single server can handle this. But we need reliability (retries, multiple channels), so the architecture is about correctness, not raw throughput.

Step 3: High-Level Design

      Trigger
         ↓
  Notification Service
         ↓
       Queue
         │
 ┌───────┼───────┐
 ▼       ▼       ▼
Push   Email    SMS
 │       │       │
 ▼       ▼       ▼
FCM  SendGrid Twilio

FCM (Firebase Cloud Messaging), SendGrid, and Twilio are third-party delivery APIs for push notifications, email, and SMS respectively.

Why a queue? Because delivery is slow and unreliable (third-party APIs). The queue decouples "decide to send" from "actually send." If Twilio is down, SMS messages wait in the queue instead of being lost.

Step 4: Deep Dive

The interviewer asks: "How do you prevent duplicates?"

Each notification gets a unique ID. Before sending, the worker checks a deduplication store (Redis with the notification ID as key, TTL of 24 hours). If the ID exists, skip it. If not, set the key and send.

This makes the system idempotent. Even if a message is delivered to the worker twice (at-least-once queue semantics), the user only gets one notification.

That's the process. Requirements told us what to build. Estimation told us the scale is manageable. The high-level design introduced a queue for reliability. The deep dive solved a specific hard problem (deduplication).

Common Mistakes

Jumping to solutions — "We'll use Kafka and Redis and Kubernetes." Why? What problem does each solve? Start with the problem, not the tools.

Over-engineering — adding components for problems you don't have. If you have 1,000 users, you don't need sharding. Design for current scale with a path to grow.

Ignoring tradeoffs — every decision has a cost. Caching adds complexity and staleness risk. Microservices add network overhead and operational burden. State the tradeoff explicitly.

No numbers — "it should be fast" isn't a requirement. "p99 latency under 200ms for 10K concurrent users" is. Numbers drive decisions.

Tradeoff Analysis

System design is about tradeoffs. There's rarely a "best" answer. There's only "best for these constraints."

When evaluating options, ask:

What does this optimize for? (latency? throughput? consistency? cost?)
What does it sacrifice?
What happens when it fails?
How complex is it to operate?

State your reasoning. "I'm choosing eventual consistency here because the use case tolerates stale data for a few seconds, and it lets us scale reads horizontally without coordination."

A Template for Any Design

The four steps above are the core loop. Here's an expanded checklist that adds data modeling, API design, and failure handling for completeness:

1. Requirements (functional + non-functional)
2. Estimation (QPS, storage, bandwidth)
3. High-level design
   a. Architecture (boxes and arrows)
   b. Data model (what's stored, how it's accessed)
   c. API design (endpoints, request/response)
4. Deep dive
   a. The hardest part (scale, consistency, etc.)
   b. Failure handling (what breaks, how you recover)

You don't always cover all of these in an interview. But thinking through each one makes your design complete.

Key Takeaways

Always start with requirements, never with components
Estimate scale before choosing technology
Add complexity only when the requirements demand it
Every design decision is a tradeoff — state it explicitly
A structured process prevents rambling and missed requirements