A rate limiter interview hinges on two answers: which algorithm you pick (and why), and how you keep the count correct across many servers. Get those right and you've shown the depth they're probing. This applies the standard system design framework to a deceptively small problem.
A rate limiter caps how many requests a client can make in a window (e.g., 100 requests/minute per user). It protects services from abuse, accidental overload, and runaway costs — and it's a favorite because it's small enough to finish yet rich enough to expose whether you understand distributed state.
1. Requirements
Functional: allow N requests per client per time window; reject excess with HTTP 429 Too Many Requests (and a Retry-After header). Non-functional: low latency (it sits in front of every request — it must be fast), highly available (if the limiter is down, decide: fail-open for availability or fail-closed for protection — call this out), and accurate enough across a distributed fleet.
Clarify the key: per user? per IP? per API key? per endpoint? It changes the design surface, so ask.
2. The algorithm (the crux)
| Algorithm | How it works | Trade-off |
|---|---|---|
| Fixed window counter | Count requests per fixed clock window (e.g., per minute) | Simple, but allows a 2× burst at the window boundary |
| Sliding window log | Store a timestamp per request; count those within the last window | Exact, but memory-heavy (one entry per request) |
| Sliding window counter | Weighted blend of current + previous fixed window | Smooths the boundary burst; small memory; a great default |
| Token bucket | Tokens refill at a fixed rate; each request spends one; empty = reject | Allows controlled bursts; O(1) memory; the usual interview answer |
| Leaky bucket | Requests queue and drain at a constant rate | Smooths output, but adds queuing/latency |
Lead with token bucket for most APIs: it's O(1) state per client (just tokens + lastRefillTime), naturally allows short bursts, and is what most production limiters (and interviewers) expect. Mention sliding window counter as the alternative when you want to forbid bursts. Knowing why fixed-window is flawed (boundary burst) is the detail that signals depth.
3. Where it lives
At the edge / API gateway or as middleware before your application logic — you want to reject excess traffic before it consumes real resources. In a microservices setup, a shared gateway limiter is common; a sidecar/library limiter is the alternative.
4. The hard part: distributed correctness
A single server can rate-limit in memory trivially. The interview's real question is: with 50 app servers, how do you enforce one global limit per user?
- Centralized store (Redis): keep each client's counter/tokens in Redis so all servers share one source of truth. The catch is atomicity — a naive read-modify-write races under concurrency and lets requests slip through. Fix it with an atomic operation:
INCR+EXPIREfor counters, or a small Lua script (or Redis functions) that checks-and-decrements tokens atomically in one round trip. - Latency vs accuracy trade-off: a Redis hop per request adds latency. For extreme scale, allow a slightly approximate limit — each server keeps a local token allowance synced periodically — trading perfect accuracy for speed. Name the trade.
- Availability: if Redis is unreachable, fail-open (serve traffic, lose limiting) is usually safer than failing closed and taking an outage — but say it's a product decision.
5. Bottlenecks & trade-offs to name unprompted
- Boundary burst (fixed window) → sliding window counter or token bucket.
- Race conditions on the shared counter → atomic INCR / Lua script, never read-then-write.
- Hot keys (one abusive client) → still O(1) per key; Redis handles it, but mention sharding for extreme cases.
- Clock skew across servers → prefer the store's time or token-bucket math over per-server wall clocks.
- Response semantics → return 429 +
Retry-Afterso well-behaved clients back off.
Why it shows up so often
It's compact but it forces a real algorithm choice and a genuine distributed-systems problem (atomic shared state). Master the token-bucket-on-Redis answer and you've covered the version asked in the vast majority of interviews. For the broader method, see the framework; for a fuller read-heavy example, the URL shortener.
Written by Amit Singh — Senior SDE at Amazon, Claude Certified Architect, and founder of AlgoEngineer. We drill these with live mock interviews in our System Design course.