Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote an article on why he likes Python that I expect resonates with many.
Additionally Python is an interpreted language so it is easier for me to distribute my solutions to clients than would be for a compiled language like C. Most of my scraping jobs are relatively small so distribution overhead is important.
A few people have suggested I use ruby instead. I have used ruby and like it, but found it lacks the depth of libraries available to Python.
However Python is by no means perfect - for example there are limitations with threading, using unicode is awkward, and distributing on Windows can be difficult. And there are also many redundant or poorly designed builtin libraries. Some of these issues are being addressed in Python 3, some not.
If I was to ever change language I expect it would be to something more equipped for parallel programming like erlang or haskell.
As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.
My goal for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The idea was to build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.
I use sitescraper for much of my scraping work and sometimes make updates based on experience gained from a project. Here is some example usage:
Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:
The results are:
regex_test took 40.032 ms lxml_test took 1863.463 ms bs_test took 54206.303 ms
That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.
XPaths are very useful for most web scraping tasks, but there is still a use case for regular expressions.
In an earlier post I referred to XPaths but did not explain how to use them.
Say we have the following HTML document:
To access the list elements we follow the HTML structure from the root tag down to the li’s:
An XPath to represent this traversal is:
If a tag has no index then every tag of that type will be selected:
XPaths can also use attributes to select nodes:
And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:
This is more reliable than an absolute XPath because it can still locate the correct content after the surrounding structure is changed.
There are other features in the XPath standard but the above are all I use regularly.
A handy way to find the XPath of a tag is with Firefox’s Firebug extension. To do this open the HTML tab in Firebug, right click the element you are interested in, and select “Copy XPath”. (Alternatively use the “Inspect” button to select the tag.)
This will give you an XPath with indices only where there are multiple tags of the same type, such as:
One thing to keep in mind is Firefox will always create a tbody tag within tables whether it existed in the original HTML or not. This has tripped me up a few times!
For one-off scrapes the above XPath should be fine. But for long term repeat scrapes it is better to use a relative XPath around an ID element with attributes instead of indices. From my experience such an XPath is more likely to survive minor modifications to the layout. However for a more robust solution see my SiteScraper library, which I will introduce in a later post.
HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.
Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags:
but it still needs to be interpreted as a proper list:
This means we can’t naively parse HTML by assuming a tag ends when we find the next closing tag. Instead it is best to use one of the many HTML parsing libraries available, such as BeautifulSoup, lxml, html5lib, and libxml2dom.
Seemingly the most well known and used such library is BeautifulSoup. A Google search for Python web scraping module currently returns BeautifulSoup as the first result.
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found lxml more efficient than the other parsing libraries, though my priority is accuracy over speed.
You will need to use version 2 onwards of lxml, which includes the html module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.
Here is an example how to parse the previous broken HTML with lxml:
In this post I will try to clarify what web scraping is all about by walking through a typical (though fictional) project.
Firstly a client contacted through my quote form requesting US demographic data in a spreadsheet from the official census website. I spent some time getting to know this website and found it followed a simple hierarchy with navigation performed through selecting options from select boxes:
|Overview page / stage pages / county pages||city pages|
I emailed the client back that the census website was relatively small sized and easily navigable. I would be able to provide a spreadsheet of the census data within 3 days for $200. The client was satisfied with this arrangement, so it was time to get started.
The first step was to collect all the state page URLs from the select box using an XPath expression. I use FireFox’s Firebug extension to identify the appropriate XPath. I found the county and city pages followed the same structure so this XPath could be used to extract URLs from them too. Now I have all the location URLs. These URLs could have been collected manually but this would take longer, be boring, and be harder to update if the website changed in future.
I set the script to download all these locations and meanwhile start work on the scraper part. Here is a sample location page with a large table for the demographic details. Again I craft a set of XPaths to extracted the content.
Now I am on the home stretch. I combine these various parts together into a single script that iterates the location pages, extracts the content with XPath, and writes out the result to a CSV spreadsheet file.
While the webpages are still downloading I provide a sample to the client for feedback. They request separate spreadsheets for state, county, and city, which is fine. Providing updated formats is straightforward because all downloaded webpages are cached.
When downloading has completed I send the final version and an invoice. QED