System Design Notes
Don’t forget to get your copy of Designing Data Intensive Applications the single most important book to read for system design interview prep!

Web Crawler Design

If you have a major software engineering interview coming up, one of the most popular system design questions you should be preparing for is ' how to build a web crawler. Also known as spider, spiderbot, and crawler, a web crawler is a preliminary step in most applications where several sources on the World Wide Web are to be utilized. No wonder, Google and all other search engines use web crawlers to collect data from websites.

What Is A Web Crawler?

Web crawling or web indexing is a program that collects webpages on the internet and stores them in a file, making them easier to access. Once it is fed with the initial reference pages, or 'seed URLs', it indexes the web links on those pages. Next, the indexed web pages are traversed and the web links within them are extracted for traversal. The crawler discovers new web links by recursively visiting and indexing new links in the already indexed pages.

Most Popular Applications

Search engines, including Google, DuckDuckGo, Bing, and Yahoo, have their own web crawlers to find and display pages. Other applications include price comparison portals, data mining, malware detection, and web analysis tools. Automated maintenance of websites will also require a web crawler.

What Features Are Involved?

Before actually building a crawler, there are some considerations involved. Let's take a look:

  1. Crawling Frequency: Also known as crawl rate, or crawl frequency refers to how often you want to crawl a website. You can have different crawl rates for different websites. For example, news websites might need to be crawled more often.
  2. Dedup: Where multiple crawlers are used, they may add duplicate links to the same URL pool. Dedup or duplicate detection involves the use of a space-efficient system, like Bloom Filter, to detect duplicate links, so your design isn't crawling the same sites.
  3. Protocols: Think about the protocols that your crawler will cater to. A basic crawler can handle HTTP links, but you can also modify the application to work over STMP or FTP.
  4. Capacity: Each page that is crawled will carry several URLs to index. Assume an estimate of around 50 billion pages. Assuming an average page size of 100kb:

    50 B x 100 KBytes = 5 petabytes

    You would need around 5 petabytes of storage, give or take 2 petabytes, to hold the information on the web.

    You can compress the documents to save storage since you'll not need to refer to it every time. For certain applications, such as search engines, you may only need to extract the metadata information before compressing it. When you do need the entire content of the page, you can access it through the cached file.

Design Diagram

As you can see in the system design diagram, the loop is initiated through a set of 'seed URLs' that is created and fed into the URL frontier. The URL frontier applies algorithms to build URL queues based on certain constraints, prioritization and politeness, which we'll discuss in detail further in the post.

Module 3, which is the URL fetcher receives the URLs waiting in the queue one by one, receives the address against it from the DNS resolver and downloads the content from that page. The content is cached by module 5 for easier access to the processors. It is also compressed and stored after going through the document De-Dup test at module 6. This test checks to see if the content has already been crawled.

The cached pages are processed, going through different modules (8a, 8b and 8c) in the process. All the links on the page are extracted, filtered based on certain protocols and passed through the URL-Dedup test to see if multiple URLs point to the same document and discard repetitions. The unique set of URLs received through the processing modules are fed back to the URL frontier for the next crawl cycle.

Click here to continue reading this lesson on Medium.