โ† Back to Blog

How to Design ChatGPT: A System Design Walkthrough

May 18, 2026 ยท 7 min read system-designinterviewaiarchitecture

I got asked this question in a mock interview last week. "Design ChatGPT." Two words. Thirty minutes on the clock. And I completely blanked.

Not because I don't understand LLMs. I use them every day. But there's a gap between using a system and explaining how to build one from scratch. I fumbled through some vague answer about "GPUs and a load balancer" and felt like an idiot afterward.

So I went home and actually worked through it. Here's what I came up with.

Why This Question Is Tricky

Most system design questions have well-known patterns. Design a URL shortener? Hash function, database, redirect. Design Twitter? Fan-out, timeline cache, pub/sub.

But "design ChatGPT" is different. The bottleneck isn't your typical web service problem. It's not disk I/O or database queries. It's GPU compute. Every single response requires running billions of matrix multiplications on specialized hardware that costs $30k per chip.

That changes everything about how you think about the architecture.

Start With Requirements

Before drawing boxes, I'd clarify what we're actually building.

Functional requirements:

  • Users send text messages, get text responses back
  • Conversations have memory (multi-turn context)
  • Responses stream token by token (not one big blob)
  • Different model tiers (GPT-4, GPT-3.5, etc.)

Non-functional requirements:

  • First token should appear in under a second
  • Support millions of concurrent users
  • 99.9% availability
  • Graceful degradation under load (queue, don't crash)

You know what's interesting? The streaming part isn't optional. If ChatGPT waited until the full response was generated before showing anything, you'd be staring at a blank screen for 10-30 seconds. That's unusable. Streaming is a core UX requirement.

The High-Level Architecture

Here's how the pieces fit together:

Client (Browser / Mobile / API)
        โ”‚
        โ–ผ
   Load Balancer (L7, WebSocket-aware)
        โ”‚
        โ–ผ
   API Gateway (auth, rate limiting, validation)
        โ”‚
        โ”œโ”€โ”€โ–บ Conversation Service
        โ”‚         โ”‚
        โ”‚         โ–ผ
        โ”‚    Conversation DB (PostgreSQL / DynamoDB)
        โ”‚
        โ–ผ
   Inference Router
        โ”‚
        โ–ผ
   GPU Cluster (model serving)
        โ”‚
        โ–ผ
   Token Stream (SSE back to client)

Nothing here is revolutionary on its own. The magic is in how these pieces handle the GPU bottleneck.

The Conversation Service

This one's straightforward. It stores message history and assembles the context window for each request.

Here's the catch though. GPT-4 has a context window of, say, 128K tokens. But most conversations don't use anywhere near that. The service needs to:

  1. Store the full conversation history
  2. When a new message comes in, assemble the context (system prompt + conversation history + new message)
  3. If the assembled context exceeds the token limit, truncate or summarize older messages

The storage is simple. A conversations table, a messages table, foreign key relationship. PostgreSQL handles this fine at scale with proper indexing.

But the context assembly logic? That's where it gets interesting. Do you truncate from the beginning? Summarize old messages into a condensed form? Keep the first message (often contains important instructions) and drop the middle?

OpenAI appears to use a sliding window with the system message always pinned. That's a reasonable default.

The Inference Router

This is the brain of the operation. When a request comes in, the router decides:

  • Which model to use (based on the user's subscription tier)
  • Which GPU cluster has capacity
  • Whether to queue the request or serve it immediately
  • Priority (paid users jump the line)

Think of it like a smart load balancer, but for GPU resources instead of web servers.

The router maintains a view of cluster health and capacity. It knows which nodes are overloaded, which have free slots in their batch, and which are unhealthy. It routes accordingly.

GPU Serving: Where It Gets Hard

Here's where this design diverges from a typical web service. A normal API server can handle thousands of requests per second on a single machine. A GPU running inference? Maybe 10-50 concurrent requests, depending on the model size and batch configuration.

Batching is everything. If you process one request at a time, you're wasting 90% of your GPU capacity. The model weights are already loaded in memory. Running one sequence through them vs. running 20 sequences costs almost the same in wall-clock time.

So the serving layer batches incoming requests together. It waits a few milliseconds to collect requests, then processes them as a batch. This is called continuous batching, and it's how you get from "1 request per GPU" to "50 concurrent requests per GPU."

KV-cache is the other big optimization. In a multi-turn conversation, you don't want to recompute attention for all previous tokens every time the user sends a new message. You cache the key-value pairs from previous turns. This turns a 10-second computation into a 1-second one for follow-up messages.

But KV-cache eats GPU memory. A single conversation's cache can be several gigabytes for long contexts. So you need a strategy: evict caches for idle conversations, keep hot ones in GPU memory, maybe spill to CPU memory or even disk for conversations that might resume.

Streaming: SSE vs WebSocket

How do tokens get back to the client? Two options:

Server-Sent Events (SSE) is simpler. It's a one-way stream over HTTP. The client opens a connection, the server pushes tokens as they're generated. This is what OpenAI's API actually uses.

WebSockets are bidirectional but more complex to manage at scale. You need sticky sessions or a pub/sub layer to route messages to the right connection.

I'd go with SSE for this design. It's simpler, works through CDNs and proxies more reliably, and you don't need bidirectional communication. The client sends a request, then listens for the stream. Done.

Each token gets sent as a small JSON event:

data: {"token": "Hi"}
data: {"token": ","}
data: {"token": " how"}
data: {"token": " can"}
data: {"token": " I"}
data: {"token": " assist"}
data: {"token": " you"}
data: {"token": "?"}
data: [DONE]

Rate Limiting and Queuing

You can't just throw infinite requests at a GPU cluster. You need multiple layers of protection:

Per-user rate limits. Free tier gets 3 requests per minute. Paid tier gets 60. Implemented with token buckets in Redis.

Global admission control. If all GPU clusters are at capacity, new requests go into a queue. The queue has a max depth. If it's full, return a 503 with a retry-after header.

Priority queuing. Paid users get a separate, higher-priority queue. Their requests get picked up first when GPU slots free up.

This is why ChatGPT sometimes says "we're at capacity." It's not broken. It's the system doing exactly what it should: protecting GPU resources from overload.

Back-of-Envelope Numbers

Let's sanity check this design.

  • 100M daily active users
  • Average 10 messages per day per user
  • That's 1 billion messages per day, or about 12,000 requests per second

Each response takes roughly 10 seconds to generate (500 tokens at 50 tokens/sec). With batching, each GPU handles about 10 concurrent requests. So at peak:

  • 12,000 requests/sec ร— 10 sec average duration = 120,000 concurrent requests
  • 120,000 / 10 per GPU = 12,000 GPUs needed at peak

That's... a lot of GPUs. And that's why OpenAI's compute bill is measured in billions. It's also why the free tier has aggressive rate limits and why there's a paid tier at all.

What About Fine-Tuning and Updates?

Model updates need to be seamless. You can't take the whole system down to deploy a new model version.

The approach: blue-green deployment at the GPU cluster level. Spin up new clusters with the updated model. Route a percentage of traffic to them. Monitor quality metrics. Gradually shift all traffic over. Tear down old clusters.

The inference router makes this possible. It already knows how to route to different clusters. Adding version-aware routing is a small extension.

Storage Summary

Data Store Why
Conversations & messages PostgreSQL Structured, transactional
User accounts & billing PostgreSQL Relational
Rate limit counters Redis Fast, ephemeral
Request queue Kafka Durable, ordered, handles backpressure
Model weights S3 Large blobs, versioned
KV-cache GPU VRAM / CPU RAM Ultra-low latency access

What I'd Mention in an Interview

If I got this question again, I'd focus on three things:

  1. The GPU bottleneck shapes everything. This isn't a typical stateless web service. The expensive resource is GPU compute, and the entire architecture exists to maximize GPU utilization.

  2. Batching and KV-cache are the key optimizations. Without them, you'd need 10x more GPUs. They're not nice-to-haves. They're what makes the economics work.

  3. Streaming is a UX requirement, not a feature. Without token streaming, the product is unusable. It changes how you think about the connection layer.

Everything else (the conversation service, rate limiting, queuing) is standard distributed systems stuff. The GPU layer is what makes this question unique.

The Takeaway

System design interviews aren't about memorizing architectures. They're about reasoning through constraints. And for ChatGPT, the constraint is clear: GPUs are expensive, slow, and scarce. Every design decision flows from that.

Next time someone asks you to "design ChatGPT," don't start with the load balancer. Start with the GPU. That's where the interesting tradeoffs live.

Got thoughts on this post?

I'd love to hear from you. Reach out on any of these:

Want to learn by doing?

ByteLearn.dev has free courses with interactive quizzes for developers.

Browse courses โ†’
ยฉ 2026 ByteLearn.dev. Free courses for developers. ยท Privacy