Back to Blog
System DesignInterview TipsDistributed SystemsFAANG

Design a Rate Limiter — System Design Interview Walkthrough

Amit Singh

Amit Singh

Author

June 25, 2026
11 min read

A rate limiter interview hinges on two answers: which algorithm you pick (and why), and how you keep the count correct across many servers. Get those right and you've shown the depth they're probing. This applies the standard system design framework to a deceptively small problem.

A rate limiter caps how many requests a client can make in a window (e.g., 100 requests/minute per user). It protects services from abuse, accidental overload, and runaway costs — and it's a favorite because it's small enough to finish yet rich enough to expose whether you understand distributed state.

1. Requirements

Functional: allow N requests per client per time window; reject excess with HTTP 429 Too Many Requests (and a Retry-After header). Non-functional: low latency (it sits in front of every request — it must be fast), highly available (if the limiter is down, decide: fail-open for availability or fail-closed for protection — call this out), and accurate enough across a distributed fleet.

Clarify the key: per user? per IP? per API key? per endpoint? It changes the design surface, so ask.

2. The algorithm (the crux)

AlgorithmHow it worksTrade-off
Fixed window counterCount requests per fixed clock window (e.g., per minute)Simple, but allows a 2× burst at the window boundary
Sliding window logStore a timestamp per request; count those within the last windowExact, but memory-heavy (one entry per request)
Sliding window counterWeighted blend of current + previous fixed windowSmooths the boundary burst; small memory; a great default
Token bucketTokens refill at a fixed rate; each request spends one; empty = rejectAllows controlled bursts; O(1) memory; the usual interview answer
Leaky bucketRequests queue and drain at a constant rateSmooths output, but adds queuing/latency

Lead with token bucket for most APIs: it's O(1) state per client (just tokens + lastRefillTime), naturally allows short bursts, and is what most production limiters (and interviewers) expect. Mention sliding window counter as the alternative when you want to forbid bursts. Knowing why fixed-window is flawed (boundary burst) is the detail that signals depth.

3. Where it lives

At the edge / API gateway or as middleware before your application logic — you want to reject excess traffic before it consumes real resources. In a microservices setup, a shared gateway limiter is common; a sidecar/library limiter is the alternative.

4. The hard part: distributed correctness

A single server can rate-limit in memory trivially. The interview's real question is: with 50 app servers, how do you enforce one global limit per user?

  • Centralized store (Redis): keep each client's counter/tokens in Redis so all servers share one source of truth. The catch is atomicity — a naive read-modify-write races under concurrency and lets requests slip through. Fix it with an atomic operation: INCR + EXPIRE for counters, or a small Lua script (or Redis functions) that checks-and-decrements tokens atomically in one round trip.
  • Latency vs accuracy trade-off: a Redis hop per request adds latency. For extreme scale, allow a slightly approximate limit — each server keeps a local token allowance synced periodically — trading perfect accuracy for speed. Name the trade.
  • Availability: if Redis is unreachable, fail-open (serve traffic, lose limiting) is usually safer than failing closed and taking an outage — but say it's a product decision.

5. Bottlenecks & trade-offs to name unprompted

  • Boundary burst (fixed window) → sliding window counter or token bucket.
  • Race conditions on the shared counter → atomic INCR / Lua script, never read-then-write.
  • Hot keys (one abusive client) → still O(1) per key; Redis handles it, but mention sharding for extreme cases.
  • Clock skew across servers → prefer the store's time or token-bucket math over per-server wall clocks.
  • Response semantics → return 429 + Retry-After so well-behaved clients back off.

Why it shows up so often

It's compact but it forces a real algorithm choice and a genuine distributed-systems problem (atomic shared state). Master the token-bucket-on-Redis answer and you've covered the version asked in the vast majority of interviews. For the broader method, see the framework; for a fuller read-heavy example, the URL shortener.


Written by Amit Singh — Senior SDE at Amazon, Claude Certified Architect, and founder of AlgoEngineer. We drill these with live mock interviews in our System Design course.

Ready to Ace Your Interviews?

Join thousands of students who have successfully landed their dream jobs at FAANG companies.