If you have a major software engineering interview coming up, one of the most popular system design questions you should be preparing for is ' how to build a web crawler. Also known as spider, spiderbot, and crawler, a web crawler is a preliminary step in most applications where several sources on the World Wide Web are to be utilized. No wonder, Google and all other search engines use web crawlers to collect data from websites.
Web crawling or web indexing is a program that collects webpages on the internet and stores them in a file, making them easier to access. Once it is fed with the initial reference pages, or 'seed URLs', it indexes the web links on those pages. Next, the indexed web pages are traversed and the web links within them are extracted for traversal. The crawler discovers new web links by recursively visiting and indexing new links in the already indexed pages.
Search engines, including Google, DuckDuckGo, Bing, and Yahoo, have their own web crawlers to find and display pages. Other applications include price comparison portals, data mining, malware detection, and web analysis tools. Automated maintenance of websites will also require a web crawler.
Before actually building a crawler, there are some considerations involved. Let's take a look:
Capacity: Each page that is crawled will carry several URLs to index. Assume an estimate of around 50 billion pages. Assuming an average page size of 100kb:
50 B x
100 KBytes = 5 petabytes
You would need around 5 petabytes of storage, give or take 2 petabytes, to hold the information on the web.
You can compress the documents to save storage since you'll not need to refer to it every time. For certain applications, such as search engines, you may only need to extract the metadata information before compressing it. When you do need the entire content of the page, you can access it through the cached file.
As you can see in the system design diagram, the loop is initiated through a set of 'seed URLs' that is created and fed into the URL frontier. The URL frontier applies algorithms to build URL queues based on certain constraints, prioritization and politeness, which we'll discuss in detail further in the post.
Module 3, which is the URL fetcher receives the URLs waiting in the queue one by one, receives the address against it from the DNS resolver and downloads the content from that page. The content is cached by module 5 for easier access to the processors. It is also compressed and stored after going through the document De-Dup test at module 6. This test checks to see if the content has already been crawled.
The cached pages are processed, going through different modules (8a, 8b and 8c) in the process. All the links on the page are extracted, filtered based on certain protocols and passed through the URL-Dedup test to see if multiple URLs point to the same document and discard repetitions. The unique set of URLs received through the processing modules are fed back to the URL frontier for the next crawl cycle.
50% off Udemy courses
Grokking the System Design Interview
Java Multithreading for Senior Engineering Interviews
Grokking the Advanced Design Interview
Grokking the Coding Interview: Patterns for Coding Questions
Grokking Dynamic Programming Patterns for Coding Interviews
Coderust: Hacking the Coding Interview