A web crawler is a distributed breadth-first traversal of the web graph — and the entire interview is about three constraints that make that BFS hard at scale: politeness (don't hammer a site), deduplication (don't crawl the same content twice), and progress (don't get trapped or starved). Nail the URL frontier design and those three constraints, and the rest is a standard pipeline. This applies the standard system design framework.
1. Requirements
Functional: start from seed URLs, fetch pages, extract links, and keep crawling outward; store page content for downstream use (e.g., a search index); re-crawl for freshness. Out of scope (say so): the search index/ranking itself, JavaScript rendering (mention it's a heavier variant).
Non-functional: scalable (billions of pages), polite (respect robots.txt and per-host rate limits), robust (handle timeouts, traps, malformed HTML), and fresh (re-crawl important pages over time). Add extensibility — the content type mix changes.
2. Scale (back-of-envelope)
To crawl ~1B pages/month at ~100 KB/page → ~100 TB/month of raw content (→ object storage), at ~400 pages/sec sustained. The numbers say two things: storage goes to a blob store, and the frontier and politeness scheduling are the real engineering, not the fetching.
3. Core components
- URL Frontier — the heart: a queue of URLs to crawl, which must enforce priority (crawl important pages first) and politeness (cap requests per host).
- Fetchers — a worker pool that downloads pages (respecting
robots.txt+ timeouts). - Parser / link extractor — pulls out links and content.
- Dedup — drop already-seen URLs and already-seen content.
- Storage — raw pages in an object store; metadata (URL → last-crawled, hash) in a DB.
4. The crux: the URL frontier
This is where the interview is won. The frontier is usually two layers of queues:
- Front queues (priority): assign each URL a priority (e.g., by page importance/freshness) → route to a priority band.
- Back queues (politeness): one logical queue per host, each with a timestamp of when that host may next be hit. A worker only pulls from a host queue whose "next allowed" time has passed.
This two-layer design lets you crawl aggressively overall while staying polite per host — naming the front/back-queue split (priority × politeness) is the senior signal. The frontier is distributed (partition hosts across nodes) so no single host's politeness limit throttles the whole crawl.
5. Deduplication (two kinds)
- URL dedup: maintain a massive "seen URLs" set. Exact storage of billions of URLs is expensive, so a Bloom filter (probabilistic, tiny, allows rare false positives) in front of a durable store is the classic answer.
- Content dedup: different URLs often serve identical content (mirrors, session params). Hash the page content (e.g., a fingerprint/SimHash for near-duplicates) and skip content you've already stored.
6. The fetch loop
- Pull a ready URL from the frontier (host politeness satisfied).
- Check
robots.txt(cached per host); skip if disallowed. - Fetch with a timeout; handle non-200s, redirects, and retries with backoff.
- Store raw content in the object store; record
url → contentHash, fetchedAt. - Parse links; for each new link, run URL + content dedup, then enqueue into the frontier with a priority.
7. Bottlenecks & trade-offs to name unprompted
- Politeness vs throughput: per-host limits cap speed for big sites → spread across many hosts in parallel; the back-queue design is exactly this trade.
- Crawler traps: infinite calendars, session-id URLs, deep dynamic links → cap URL depth/length, normalize URLs, and detect content dedup to escape.
- Freshness vs cost: you can't re-crawl everything constantly → prioritize re-crawl by change frequency/importance.
- Bloom-filter false positives: a tiny chance of skipping a new URL — acceptable trade for the memory savings; back it with a durable set if exactness matters.
- Robustness: timeouts, malformed HTML, huge pages, and rate-limit/ban responses all need graceful handling.
Why interviewers use this one
It moves the candidate into distributed scheduling territory — a priority-and-politeness queue, probabilistic dedup, and robustness — rather than CRUD. It's the same 7-step framework; the crux just shifts to the frontier and dedup instead of the database (as in the URL shortener) or fan-out (as in Instagram).
Written by Amit Singh — Senior SDE at Amazon, Claude Certified Architect, and founder of AlgoEngineer. We run live mock system-design interviews on problems like this in our System Design course.