As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.
My goal for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The idea was to build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.
I use sitescraper for much of my scraping work and sometimes make updates based on experience gained from a project. Here is some example usage:
>>> from sitescraper import sitescraper >>> ss = sitescraper() >>> url = 'http://www.amazon.com/s/ref=nb_ss_gw? url=search-alias%3Daps&field-keywords=python&x=0&y=0' >>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition", "Programming in Python 3: A Complete Introduction to the Python Language", "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]] >>> ss.add(url, data) >>> # we can add multiple example cases, >>> # but this is a simple example so one will do (I generally use 3) >>> # ss.add(url2, data2) >>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw? url=search-alias%3Daps&field-keywords=linux&x=0&y=0') ["Amazon.com: linux", [ "A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", "Linux Pocket Guide", "Linux in a Nutshell (In a Nutshell (O'Reilly))", 'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 'Linux Bible, 2008 Edition' ]]