I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.
However this approach is not helpful for me these days because most of my work involves scraping a website once-off. It is quicker to just specify the XPaths required than collect and test the training cases.
I would still like an automated approach to help me work more efficiently. Ideally I would have a solution that when given a website URL:
- crawls the website
- organize the webpages into groups that share the same template (a directory page will have a different HTML structure than a listing page)
- the group with the most amount of webpages should be the listings
- compare these listing webpages to find what is static (the template) and what changes
- the parts that change represent dynamic data such as description, reviews, etc
Apparently this process of scraping data automatically is known as wrapper induction in academia. Unfortunately there do not seem to be any good open source solutions yet. The most commonly referenced one is Templatemaker, which is aimed at small text blocks and crashes in my test cases of real webpages. The author stopped development in 2007.
Some commercial groups have developed their own solutions so this certainly is technically possible:
If I do not find an open source solution I plan to attempt building my own later this year.