Python Search Engine - Crawler Part 3
written on Tuesday, February 22, 2011
Introduction
Because the crawling process is I/O bound, it is very useful to fetch pages in threads. I chose an architecture with an administrator class and a various number of worker threads doing the I/O stuff.
The multi-threaded architecture
This post will not cover how to implement a multi-threading software in Python. This topic is covered in a lot of articles and the Python documentation. The most significant step by developing a software with multi-threading is not the implementation but rather the design. You need a considered software architecture if you want to gain benefits from the threading part.
As previously mentioned I have chosen an architecture with an administrative class which delegates the crawling part to some working threads, because the bottleneck of the application is the network I/O. Fetching pages in multiple threads minimizes the disadvantage of the network latency.
Choosing an optimal number of crawlers
To analyze how many working threads should be started to gain the best performance, I wrote a simply script which counts how many pages can be fetched within fifteen minutes using a specific amount of crawling threads. See the source code for this experiment here. The result is in the following figure.
Note: The result is depending on the running environment. If you want the optimal number for your running environment, don't hesitate to run the experiment script on your own.
Source code
Find the relevant parts of this article in the repository (tagged as "blog3").
Outlook
The next blog entry covers the usage of the NoSQL database MongoDB to store pages and fetch search results with high performance.
blog comments powered by Disqus