03 - Networking Fundamentals
📋 Jump to TakeawaysUnderstanding networking protocols is essential for system design. Every decision — from choosing TCP vs UDP to picking an HTTP version — affects latency, reliability, and scalability.
TCP vs UDP
TCP (Transmission Control Protocol) — reliable, ordered, connection-oriented.
- Uses a 3-way handshake (SYN → SYN-ACK → ACK) before data flows. After that, data packets flow back and forth with just ACKs to confirm receipt. The connection stays open until one side sends a FIN to close it.
- Guarantees delivery through acknowledgments and retransmissions
- Flow control: receiver tells sender how much data it can handle (sliding window)
- Congestion control: sender slows down when the network is saturated (slow start, AIMD¹)
- Adds overhead: connection setup latency, header size, retransmission delays
¹ AIMD (Additive Increase, Multiplicative Decrease): slowly increase sending rate while things are fine, cut in half when packet loss is detected. This prevents network collapse while probing for available bandwidth.
UDP (User Datagram Protocol) — connectionless, no delivery guarantees.
- No handshake — send packets immediately (saves 1 round trip vs TCP)
- "Unreliable" means no guarantee at the protocol level, not "broken" — it's a deliberate tradeoff for speed
- No ordering, no congestion control, no retransmissions — a lost packet doesn't block subsequent ones
- Minimal overhead: 8-byte header vs TCP's 20+ bytes
- The application handles retries if needed (e.g., DNS retries after a timeout, video streaming skips the lost frame)
- UDP has no concept of a "connection" — the sender has no idea if the client is still there. Applications solve this with heartbeats or timeouts.
When to Choose Which
| Criteria | TCP | UDP |
|---|---|---|
| Reliability required | ✅ Guaranteed delivery | ❌ Best-effort |
| Latency-sensitive | ❌ Handshake + retransmits add delay | ✅ Fire and forget |
| Ordering matters | ✅ In-order delivery | ❌ Out-of-order possible |
| Use cases | APIs, web pages, file transfers, databases, email | Video streaming, gaming, DNS lookups, VoIP, metrics |
| System design signal | "We can't lose data" | "Speed > reliability, or app handles loss" |
Design rule: default to TCP unless you have a specific reason to use UDP (real-time streaming, high-frequency telemetry, or you're building your own reliability layer on top).
Example: Video Streaming — On-Demand vs Live
| On-Demand (Netflix) | Live (Twitch, Zoom) | |
|---|---|---|
| Protocol | TCP (HTTP/HTTPS), increasingly QUIC | UDP (or UDP-based like WebRTC, QUIC) |
| Why | Buffering hides latency. A 30-60 second buffer means retransmissions are invisible to the user. Reliability matters — you don't want glitches in a movie. | Real-time matters. A lost frame from 1 second ago is useless — just show the next one. Retransmitting would add delay and make the stream feel laggy. |
| Tradeoff | Higher latency (buffering), perfect quality | Lower latency, occasional visual artifacts |
On-demand can afford TCP's overhead because it buffers ahead. Live streaming can't wait for retransmissions — stale data is worse than missing data.
DNS
DNS translates human-readable domain names into IP addresses.
How Resolution Works
When you type api.stripe.com in a browser, your machine doesn't know its IP. Here's how it finds it:
- Recursive resolver (your ISP or 8.8.8.8²) does the work on your behalf
- Root nameserver — knows where to find
.com,.io,.orgTLD servers (13 root clusters worldwide) - TLD nameserver³ — knows which authoritative server handles
stripe.com - Authoritative nameserver — Stripe's own DNS server (hosted on Route 53, Cloudflare, etc.). "Authoritative" means "source of truth for this domain" — not authentication. It stores the record
api.stripe.com → 13.35.67.89because Stripe configured it there. - Result is cached at every level (browser, OS, resolver) for the duration of the TTL
² 8.8.8.8 is Google's public DNS resolver. Other popular options: 1.1.1.1 (Cloudflare) and 9.9.9.9 (Quad9).
³ TLD = Top-Level Domain (.com, .org, .io, etc.). The TLD nameserver knows which authoritative server handles each domain under that TLD, but doesn't know the final IP itself.
DNS Record Types
| Type | Purpose | Example |
|---|---|---|
| A | Maps domain → IPv4 address | api.example.com → 93.184.216.34 |
| AAAA | Maps domain → IPv6 address | api.example.com → 2606:2800:... |
| CNAME | Alias to another domain | www.example.com → example.com |
| MX | Mail server for the domain | example.com → mail.example.com |
| NS | Authoritative nameservers | example.com → ns1.provider.com |
TTL and Caching
- TTL (Time to Live): how long a DNS record is cached before re-querying
- Caching happens at: browser, OS, recursive resolver, and intermediate servers
- High TTL (hours/days): fewer lookups, but slow to update during incidents
- Low TTL (30-60s): fast failover, but more DNS traffic
DNS in System Design
DNS-based load balancing:
- Round-robin DNS: return multiple IPs, clients pick one. Basic round-robin has no health checks — the DNS server doesn't know if a backend is down (though providers like Route 53 and Cloudflare add health-checked variants).
- GeoDNS: return different IPs based on client location. Routes users to nearest region.
- Weighted DNS: distribute traffic unevenly (e.g., 90/10 for canary deploys).
DNS-based load balancing is coarse-grained — it works per-connection, not per-request. It can't do session affinity or fine-grained traffic splitting. Use it for global routing; use a real load balancer (L4/L7) for request-level distribution.
DNS as a failure point:
- If your nameserver goes down, nobody can resolve your domain
- Propagation delay: changing a DNS record doesn't take effect instantly — cached entries persist until TTL expires, and some resolvers ignore TTL
- Mitigation: multiple redundant nameservers across providers, low TTL for quick failover
- Services like Route 53, Cloudflare DNS provide built-in redundancy and health checks
HTTP Versions
HTTP/1.1
- Persistent connections: reuse TCP connections across requests (keep-alive)
- Head-of-line (HOL) blocking: one slow response blocks all subsequent requests on that connection (e.g., a large image blocks a small JSON call behind it)
- Browsers open 6 parallel connections per domain to work around HOL blocking
- Workarounds: domain sharding (serve from multiple subdomains to bypass 6-connection limit), sprite sheets (combine images into one file), bundling (combine JS/CSS into one file) — all hacks to reduce request count. Obsolete with HTTP/2.
HTTP/2
- Multiplexing: multiple requests and responses interleaved on a single TCP connection — eliminates the need for multiple connections and makes HTTP/1.1 workarounds unnecessary
- Header compression (HPACK): reduces redundant header bytes across requests
- Server push: server sends resources before client requests them. Rarely used — hard to avoid pushing resources the client already cached, and Chrome dropped support in 2022.
- Stream prioritization: client hints which resources matter most
- Still suffers from TCP-level HOL blocking — TCP treats all streams as one byte stream, so one lost packet stalls all streams even if they're independent
Two kinds of HOL blocking:
- HTTP-level (HTTP/1.1) — requests queue up; request B waits for response A to finish. Fixed by HTTP/2 multiplexing.
- TCP-level (HTTP/2) — TCP guarantees in-order byte delivery, so a lost packet holds up all streams until retransmitted. Fixed by HTTP/3/QUIC, where each stream is delivered independently.
HTTP/3
- Built on QUIC⁴ — a UDP-based transport that reimplements TCP's reliability (ordering, retransmission) but per-stream instead of per-connection
- Eliminates TCP HOL blocking — packet loss in one stream doesn't affect others, because each stream is independent at the transport level
- 0-RTT connection setup: resume previous connections instantly (vs 2 RTT (Round-Trip Time) for TCP+TLS 1.3, or 3 RTT with TLS 1.2). First connection still needs 1 RTT. Think of it like: normally you knock, wait for an answer, show ID, wait for approval, then talk. With 0-RTT, you walk in and start talking because the server remembers you from last time.
- Built-in encryption (TLS 1.3 integrated into the protocol — no separate TLS handshake)
- Connection migration: switch networks (e.g., WiFi → cellular) without dropping the connection, because QUIC identifies connections by ID, not by IP+port
- Better performance on unreliable networks (mobile, high-latency)
⁴ QUIC originally stood for "Quick UDP Internet Connections" (Google), but is now just a name (IETF RFC 9000).
What's Used Today
HTTP/2 is the default. Most web servers, CDNs, and browsers use HTTP/2 today. If you deploy behind Cloudflare, AWS ALB, or Nginx, you're already on HTTP/2 without doing anything special.
HTTP/3 is growing fast. Google, Cloudflare, and Meta serve most traffic over HTTP/3. Browsers auto-upgrade when the server supports it. By 2025, ~30% of web traffic uses HTTP/3.
HTTP/1.1 is still everywhere for backend-to-backend calls, legacy APIs, and simple tools (curl, webhooks). It works fine when latency is already low.
For system design interviews: assume HTTP/2 for client-facing traffic and mention HTTP/3 as an optimization for mobile/global users.
When Each Matters in System Design
| Scenario | Best Choice | Why |
|---|---|---|
| Internal microservices (same datacenter) | HTTP/2 or gRPC | Low latency already, multiplexing helps |
| Public-facing web app | HTTP/2 minimum, HTTP/3 ideal | Reduces page load time for users |
| Mobile clients on spotty networks | HTTP/3 | Connection migration, 0-RTT, no HOL blocking |
| Legacy system integration | HTTP/1.1 | Compatibility, simplicity |
| Real-time bidirectional | WebSockets | Full-duplex communication |
Key Takeaways
- Default to TCP for most system design scenarios; choose UDP only when latency trumps reliability (streaming, gaming, telemetry)
- DNS is infrastructure — use GeoDNS for global routing, low TTLs for failover, and redundant nameservers to avoid SPOF
- HTTP/2 is the baseline for modern services — multiplexing eliminates most HTTP/1.1 workarounds
- HTTP/3 (QUIC) shines on unreliable networks — mention it when designing for mobile or global users
- DNS-based load balancing is the simplest global traffic distribution but lacks health-awareness without a managed DNS service
- Know the tradeoffs at each layer — interviewers want to hear you reason about latency, reliability, and failure modes