The thing that makes a chat system hard is real-time delivery: how do you get a message to a recipient instantly when they're online, reliably when they're offline, and to everyone when it's a group? Persistent connections (WebSocket) plus a connection registry and a durable message store answer all three. Everything else — receipts, ordering, history — hangs off that. This applies the standard system design framework to a real-time problem.
Unlike a URL shortener (request/response, read-heavy), chat is stateful and push-based — the server has to reach out to clients. That's the whole challenge.
1. Requirements
Functional: 1:1 messaging, group messaging, online/last-seen presence, delivery + read receipts, message history, and push notifications when the recipient is offline. Out of scope (say so): voice/video calls; media (mention it's stored in an object store + CDN like any blob, with a URL sent in the message).
Non-functional: low latency (messages feel instant), highly available, durable (never lose a message), ordered within a conversation, and able to handle hundreds of millions of concurrent connections.
2. The core challenge: connections
HTTP request/response can't push. So clients hold a persistent WebSocket to a fleet of connection (gateway) servers. The problem this creates: when User A sends to User B, which server holds B's connection?
- Maintain a session registry (e.g., Redis):
userId → {gatewayServerId, connectionId}, updated on connect/disconnect. - To deliver, look up B's gateway and route the message there (server-to-server), which pushes it down B's socket.
This connection layer is the heart of the design — call it out first.
3. Sending a message (the happy path)
- A's client sends the message over its WebSocket to its gateway.
- The gateway persists the message (durability before delivery — so a crash doesn't lose it) and assigns a per-conversation sequence number.
- Look up B in the session registry.
- B online: route to B's gateway → push down B's socket → B's client acks → mark delivered.
- B offline: the message is already stored; enqueue a push notification (APNs/FCM). B pulls undelivered messages on reconnect.
- Receipts (sent → delivered → read) are just small ack messages flowing back the same way.
4. Data model & storage
Chat is write-heavy (every message is a write) with time-ordered reads per conversation — a textbook fit for a wide-column store (Cassandra/HBase):
| Data | Store | Key |
|---|---|---|
| Messages | Wide-column (Cassandra) | partition by conversationId, clustered by sequence/timestamp |
| Session registry | In-memory (Redis) | userId → gateway/connection |
| User/group metadata | Relational or KV | userId, groupId → members |
Partitioning messages by conversationId keeps a conversation's history together and time-ordered, which is exactly the read pattern.
5. Ordering & delivery semantics
- Ordering: guarantee it per conversation with a monotonic sequence number assigned server-side; don't trust client clocks. Global ordering across all conversations isn't needed.
- Delivery: aim for at-least-once + idempotent client dedup (each message has a unique id), which is simpler and safer than exactly-once. The client drops duplicates by id.
6. Group messaging (the fan-out)
A group message is delivered to every member. For small groups (WhatsApp caps group size), fan out on send: write once, then deliver to each online member via their gateway and store-for-later for offline members. For very large broadcast groups you'd shift toward a pull model — but for typical group sizes, fan-out-on-send is fine. Naming the size threshold is the senior signal.
7. Bottlenecks & trade-offs to name unprompted
- Connection scale: hundreds of millions of long-lived sockets → many gateway servers + a fast session registry; connections are the main capacity constraint, not CPU.
- Durability vs latency: persist-before-deliver guarantees no message loss at the cost of a write on the hot path; this is the right trade for a messaging app.
- Presence cost: true real-time presence for everyone is expensive; most systems use periodic heartbeats and show "last seen" rather than exact status.
- Thundering herd on reconnect: when a gateway dies, its clients reconnect en masse — spread them across servers and rehydrate undelivered messages lazily.
- Hot conversations: a huge group is a hot partition; cap group size or shard.
Why interviewers use this one
It moves you off the comfortable request/response model into stateful, push-based, real-time territory — connection management, delivery guarantees, and ordering — which a CRUD-only candidate hasn't thought about. It's the same 7-step framework, applied to a problem where the server must reach the client.
Written by Amit Singh — Senior SDE at Amazon, Claude Certified Architect, and founder of AlgoEngineer. We run live mock system-design interviews on exactly these problems in our System Design course.