The sitescraper module
Posted 29 Jan 2010 in opensource and sitescraper

As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.

My goal for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The idea was to build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.

The tool was eventually called sitescraper and is available for download on bitbucket. For more information have a browse of this paper, which covers the implementation and results in detail.

I use sitescraper for much of my scraping work and sometimes make updates based on experience gained from a project. Here is some example usage:

>>> from sitescraper import sitescraper
>>> ss = sitescraper()  
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?
            url=search-alias%3Daps&field-keywords=python&x=0&y=0'  
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition",   
  "Programming in Python 3: A Complete Introduction to the Python Language",
  "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]  
>>> ss.add(url, data)  
>>> # we can add multiple example cases,
>>> # but this is a simple example so one will do (I generally use 3)  
>>> # ss.add(url2, data2)   
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?
                url=search-alias%3Daps&field-keywords=linux&x=0&y=0')  
["Amazon.com: linux", [
    "A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", 
    "Linux Pocket Guide", 
    "Linux in a Nutshell (In a Nutshell (O'Reilly))", 
    'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 
    'Linux Bible, 2008 Edition'
]]

blog comments powered by Disqus