ms4py.blog

Python Search Engine - Crawler Part 1

written on Saturday, April 10, 2010

Introduction

I'm currently working on an experimental search engine in Python named PySeeek and want to share my experience. In this blog post I want to present some basics to create a simple web crawler using the built-in library urllib2 to fetch the resources and parsing the HTML with lxml.

Fetching resources

As mentioned previously, I want to use urllib2 to fetch the content of an URL. Since a crawler needs to fetch a lot of pages it makes sence to use the urllib2.OpenerDirector. In this context you can provide a custom User-Agent.

from urllib2 import build_opener

USER_AGENT = 'PySeeek-Bot'

opener = build_opener()
opener.addheaders = [('User-Agent', USER_AGENT)]
response = opener.open('http://ms4py.org/')

Normalize URLs

Two different string representations of an URL do not result in two different pages in any case. For example, the URL "http://ms4py.org/my page/" results in the same resource than the URL "http://ms4py.org/my%20page/".

In the Werkzeug library I found such a function to normalize an URL in the needed way and modified it to remove the anchor of the given URL, because I want to have an unique URL for one single web page.

import urllib
import urlparse

def normalize_url(url):
    ''' Modified from `werkzeug.utils.url_fix`. '''

    scheme, netloc, path, qs, _ = urlparse.urlsplit(url)
    path = urllib.quote(path, '/%')
    qs = urllib.quote_plus(qs, ':&=')
    return urlparse.urlunsplit((scheme, netloc, path, qs, ''))

Analyze the Content-Type

As I want to process only resources with HTML content, I have to analyze the Content-Type of a fetched resource. So I implemented this function to parse this and the encoding from the provided HTTP-Headers.

from urllib2 import URLError

DEFAULT_ENCODING = 'utf-8'

def parse_content_type(response):
    try:
        ctype = response.info()['Content-Type']
    except KeyError:
        raise URLError('No Content-Type defined.')
    try:
        ctype, encoding = ctype.split(';')
        # encoding is now "charset=enc"
        _, encoding = encoding.split('=')
    except ValueError:
        # no or wrong encoding definition, use default
        encoding = DEFAULT_ENCODING
        try:
            ctype = ctype.split(';')[0]
        except IndexError:
            raise URLError('Could not parse Content-Type: "%s"' % ctype)

    return ctype, encoding

This function should return "text/html" as Content-Type.

ctype, encoding = parse_content_type(response)

if not ctype == 'text/html':
    raise URLError('Wrong Content-Type: "%s"' % ctype)

Parsing HTML

With lxml, the parsing process is as easy as you can imagine, because the library supports XPath and provides a function to make all links of a HTML page absolute. Notice that the page content is encoded to a string with the encoding supplied by parse_content_type.

from lxml import html

doc = html.parse(response).getroot()
title = doc.xpath("//title/text()")[0]
content = doc.text_content().encode(encoding)

links = set()
doc.make_links_absolute()
for _, _, link, _ in doc.iterlinks():
    url = normalize_url(link)
    links.add(url)

Putting it all together

You find the whole code sticked together in the GitHub repository.

This entry was tagged crawler and python


blog comments powered by Disqus