Python Search Engine - Crawler Part 2
written on Thursday, May 6, 2010
Introduction
In this part of the blog series about a search engine in Python I want to present how to automate the crawling process and how to handle the robots.txt file. Here is a log about the performance of the resulting crawler:
Total runtime: 25 min
Pages processed: 1963
Average: 1.265 Pages/s 75.927 Pages/min
Handling robots.txt
In the standard library there is a module for parsing the robots.txt file. Simple usage:
from robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('http://%s/robots.txt' % hostname)
url_allowed = rp.can_fetch(USER_AGENT, url)
I put this code into a class Host, so I have more flexibility. Now, the Crawler can handle the hosts very simple by putting Host instances into a dictionary with the hostname as key. Here is an extract from the Crawler.add_urls method.
hostname = urlparse.urlparse(url).hostname
try:
host = self.hosts[hostname]
except KeyError:
self.hosts[hostname] = host = Host(hostname)
if host.url_allowed(url):
self.urls.add(url)
Note: As you can see, the URLs to process are stored in a set, so the lookup is constant O(1).
The crawler loop
As we have ensured that there are only URLs allowed to fetch in our set, we can now start with a loop with following actions:
- Getting one by one an URL from the set.
- Extracting all links from that page with our parse function.
- Adding all found URLs to our set.
url = self.get_url_to_process()
while url is not None:
try:
title, content, links = self.parse_page(url)
except (URLError, HTTPError, httplib.InvalidURL,
UnicodeDecodeError):
self.invalid_urls.add(url)
url = self.get_url_to_process()
continue
self.handled_urls.add(url)
if links is not None:
self.add_urls(links)
url = self.get_url_to_process()
Putting it all together
You find the whole code sticked together in the GitHub repository (tagged as "blog2").
Note: Some code is outsourced in the new module utils.
Outlook
In the next blog entry I will show you how to speed the crawling process up to more than three times faster with a multi-threaded architecture.
blog comments powered by Disqus