As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.
My goal for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The idea was to build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.
The tool was eventually called sitescraper and is available for download on bitbucket. For more information have a browse of this paper, which covers the implementation and results in detail.
I use sitescraper for much of my scraping work and sometimes make updates based on experience gained from a project.
Here is some example usage: