Python Search Engine - Crawler Part 1
written on Saturday, April 10, 2010
Introduction
I'm currently working on an experimental search engine in Python named PySeeek and want to share my experience. In this blog post I want to present some basics to create a simple web crawler using the built-in library urllib2 to fetch the resources and parsing the HTML with lxml.
Fetching resources
As mentioned previously, I want to use urllib2 to fetch the content of an URL. Since a crawler needs to fetch a lot of pages it makes sence to use the urllib2.OpenerDirector. In this context you can provide a custom User-Agent.
from urllib2 import build_opener
USER_AGENT = 'PySeeek-Bot'
opener = build_opener()
opener.addheaders = [('User-Agent', USER_AGENT)]
response = opener.open('http://ms4py.org/')
Normalize URLs
Two different string representations of an URL do not result in two different pages in any case. For example, the URL "http://ms4py.org/my page/" results in the same resource than the URL "http://ms4py.org/my%20page/".
In the Werkzeug library I found such a function to normalize an URL in the needed way and modified it to remove the anchor of the given URL, because I want to have an unique URL for one single web page.
import urllib
import urlparse
def normalize_url(url):
''' Modified from `werkzeug.utils.url_fix`. '''
scheme, netloc, path, qs, _ = urlparse.urlsplit(url)
path = urllib.quote(path, '/%')
qs = urllib.quote_plus(qs, ':&=')
return urlparse.urlunsplit((scheme, netloc, path, qs, ''))
Analyze the Content-Type
As I want to process only resources with HTML content, I have to analyze the Content-Type of a fetched resource. So I implemented this function to parse this and the encoding from the provided HTTP-Headers.
from urllib2 import URLError
DEFAULT_ENCODING = 'utf-8'
def parse_content_type(response):
try:
ctype = response.info()['Content-Type']
except KeyError:
raise URLError('No Content-Type defined.')
try:
ctype, encoding = ctype.split(';')
# encoding is now "charset=enc"
_, encoding = encoding.split('=')
except ValueError:
# no or wrong encoding definition, use default
encoding = DEFAULT_ENCODING
try:
ctype = ctype.split(';')[0]
except IndexError:
raise URLError('Could not parse Content-Type: "%s"' % ctype)
return ctype, encoding
This function should return "text/html" as Content-Type.
ctype, encoding = parse_content_type(response)
if not ctype == 'text/html':
raise URLError('Wrong Content-Type: "%s"' % ctype)
Parsing HTML
With lxml, the parsing process is as easy as you can imagine, because the library supports XPath and provides a function to make all links of a HTML page absolute. Notice that the page content is encoded to a string with the encoding supplied by parse_content_type.
from lxml import html
doc = html.parse(response).getroot()
title = doc.xpath("//title/text()")[0]
content = doc.text_content().encode(encoding)
links = set()
doc.make_links_absolute()
for _, _, link, _ in doc.iterlinks():
url = normalize_url(link)
links.add(url)
Putting it all together
You find the whole code sticked together in the GitHub repository.
blog comments powered by Disqus