ms4py.blog

Python Search Engine - Crawler Part 3

written on Tuesday, February 22, 2011

Introduction

Because the crawling process is I/O bound, it is very useful to fetch pages in threads. I chose an architecture with an administrator class and a various number of worker threads doing the I/O stuff.

Identifying the shared data

In a multi-threaded environment you have to identify the shared data. In this case it would be sets of URLs and Hosts. In fact, you have three different sets of URLs. One which should be processed, the second containing all fetched URLs and the last one stores all invalid links, so they won't checked again. You have to guarantee that access to this data does not happen concurrently. A simple, but powerful way to achieve this a lock mechanism.

The multi-threaded architecture

This post will not cover how to implement a multi-threading software in Python. This topic is covered in a lot of articles and the Python documentation. The most significant step by developing a software with multi-threading is not the implementation but rather the design. You need a considered software architecture if you want to gain benefits from the threading part.

As previously mentioned I have chosen an architecture with an administrative class which delegates the crawling part to some working threads, because the bottleneck of the application is the network I/O. Fetching pages in multiple threads minimizes the disadvantage of the network latency.

Choosing an optimal number of crawlers

To analyze how many working threads should be started to gain the best performance, I wrote a simply script which counts how many pages can be fetched within fifteen minutes using a specific amount of crawling threads. See the source code for this experiment here. The result is in the following figure.

Performance plot

Note: The result is depending on the running environment. If you want the optimal number for your running environment, don't hesitate to run the experiment script on your own.

Source code

Find the relevant parts of this article in the repository (tagged as "blog3").

Outlook

The next blog entry covers the usage of the NoSQL database MongoDB to store pages and fetch search results with high performance.

This entry was tagged crawler and python


blog comments powered by Disqus