Typical web scraping job
Posted 30 Dec 2009 in big picture and business

In this post I will try to clarify what web scraping is all about by walking through a typical (though fictional) project.

Firstly a client contacted through my quote form requesting US demographic data in a spreadsheet from the official census website. I spent some time getting to know this website and found it followed a simple hierarchy with navigation performed through selecting options from select boxes:

Overview page / stage pages / county pages | city pages

I viewed the source of these webpages and found the content I was after embedded, which meant it did not rely on JavaScript and would be easier to scrape.

I emailed the client back that the census website was relatively small sized and easily navigable. I would be able to provide a spreadsheet of the census data within 3 days for $200. The client was satisfied with this arrangement, so it was time to get started.

The first step was to collect all the state page URLs from the select box using an XPath expression. I use FireFox's Firebug extension to identify the appropriate XPath. I found the county and city pages followed the same structure so this XPath could be used to extract URLs from them too. Now I have all the location URLs. These URLs could have been collected manually but this would take longer, be boring, and be harder to update if the website changed in future.

I set the script to download all these locations and meanwhile start work on the scraper part. Here is a sample location page with a large table for the demographic details. Again I craft a set of XPaths to extracted the content.

Now I am on the home stretch. I combine these various parts together into a single script that iterates the location pages, extracts the content with XPath, and writes out the result to a CSV spreadsheet file.

While the webpages are still downloading I provide a sample to the client for feedback. They request separate spreadsheets for state, county, and city, which is fine. Providing updated formats is straightforward because all downloaded webpages are cached.

When downloading has completed I send the final version and an invoice. QED

blog comments powered by Disqus