Design a Web Crawler — System Design Interview Walkthrough

A web crawler is a distributed breadth-first traversal of the web graph — and the entire interview is about three constraints that make that BFS hard at scale: politeness (don't hammer a site), deduplication (don't crawl the same content twice), and progress (don't get trapped or starved). Nail the URL frontier design and those three constraints, and the rest is a standard pipeline. This applies the standard system design framework.

1. Requirements

Functional: start from seed URLs, fetch pages, extract links, and keep crawling outward; store page content for downstream use (e.g., a search index); re-crawl for freshness. Out of scope (say so): the search index/ranking itself, JavaScript rendering (mention it's a heavier variant).

Non-functional: scalable (billions of pages), polite (respect robots.txt and per-host rate limits), robust (handle timeouts, traps, malformed HTML), and fresh (re-crawl important pages over time). Add extensibility — the content type mix changes.

2. Scale (back-of-envelope)

To crawl ~1B pages/month at ~100 KB/page → ~100 TB/month of raw content (→ object storage), at ~400 pages/sec sustained. The numbers say two things: storage goes to a blob store, and the frontier and politeness scheduling are the real engineering, not the fetching.

3. Core components

URL Frontier — the heart: a queue of URLs to crawl, which must enforce priority (crawl important pages first) and politeness (cap requests per host).
Fetchers — a worker pool that downloads pages (respecting robots.txt + timeouts).
Parser / link extractor — pulls out links and content.
Dedup — drop already-seen URLs and already-seen content.
Storage — raw pages in an object store; metadata (URL → last-crawled, hash) in a DB.

4. The crux: the URL frontier

This is where the interview is won. The frontier is usually two layers of queues:

Front queues (priority): assign each URL a priority (e.g., by page importance/freshness) → route to a priority band.
Back queues (politeness): one logical queue per host, each with a timestamp of when that host may next be hit. A worker only pulls from a host queue whose "next allowed" time has passed.

This two-layer design lets you crawl aggressively overall while staying polite per host — naming the front/back-queue split (priority × politeness) is the senior signal. The frontier is distributed (partition hosts across nodes) so no single host's politeness limit throttles the whole crawl.

5. Deduplication (two kinds)

URL dedup: maintain a massive "seen URLs" set. Exact storage of billions of URLs is expensive, so a Bloom filter (probabilistic, tiny, allows rare false positives) in front of a durable store is the classic answer.
Content dedup: different URLs often serve identical content (mirrors, session params). Hash the page content (e.g., a fingerprint/SimHash for near-duplicates) and skip content you've already stored.

6. The fetch loop

Pull a ready URL from the frontier (host politeness satisfied).
Check robots.txt (cached per host); skip if disallowed.
Fetch with a timeout; handle non-200s, redirects, and retries with backoff.
Store raw content in the object store; record url → contentHash, fetchedAt.
Parse links; for each new link, run URL + content dedup, then enqueue into the frontier with a priority.

7. Bottlenecks & trade-offs to name unprompted

Politeness vs throughput: per-host limits cap speed for big sites → spread across many hosts in parallel; the back-queue design is exactly this trade.
Crawler traps: infinite calendars, session-id URLs, deep dynamic links → cap URL depth/length, normalize URLs, and detect content dedup to escape.
Freshness vs cost: you can't re-crawl everything constantly → prioritize re-crawl by change frequency/importance.
Bloom-filter false positives: a tiny chance of skipping a new URL — acceptable trade for the memory savings; back it with a durable set if exactness matters.
Robustness: timeouts, malformed HTML, huge pages, and rate-limit/ban responses all need graceful handling.

Why interviewers use this one

It moves the candidate into distributed scheduling territory — a priority-and-politeness queue, probabilistic dedup, and robustness — rather than CRUD. It's the same 7-step framework; the crux just shifts to the frontier and dedup instead of the database (as in the URL shortener) or fan-out (as in Instagram).

Written by Amit Singh — Senior SDE at Amazon, Claude Certified Architect, and founder of AlgoEngineer. We run live mock system-design interviews on problems like this in our System Design course.