To understand the principle of entry-level crawler technology, it is enough to read this article

The processing object of general search engines is Internet pages. At present, the number of Internet pages has reached tens of billions. Therefore, the first problem that search engines face is: how to design an efficient download system to transmit such massive web page data to the local. Mirror backup of Internet web pages is formed locally.

Web crawler can play such a role and complete this arduous task. It is a very critical and basic component in the search engine system.

This article mainly introduces the technologies related to web crawlers. Although crawlers have been relatively mature in terms of the overall framework after decades of development, they also face some new challenges with the continuous development of the Internet.

Second, the general crawler technology framework

Then hand it and the relative path name of the webpage to the webpage downloader, and the webpage downloader is responsible for downloading the webpage.

For the webpage downloaded to the local, on the one hand, it is stored in the page library, waiting for subsequent processing such as indexing; Web page URL to avoid repeated crawling by the system.

For the web page just downloaded, extract all the link information contained in it, and check it in the downloaded URL queue.

Common crawler architecture

The above is the overall process of a general crawler


  1. Binding of downloaded web pagesĀ .the collection of web pages. That the crawler has downloaded. From the internet to local indexing.
    Combination of expired webpages . Due to the large number. Of webpages. It takes a long time. For the crawler. To complete a complete round of crawling. The reason for this is that. Because the internet web .pages are in a process of .constant dynamic change, it is easy to generate. Inconsistencies between .the content of the local. Web pages and the real internet



