Posted 10 Aug 2011 in big picture

I am often asked whether I can extract data from a particular website.

And the answer is always yes - if the data is publicly available then it can be extracted. The majority of websites are straightforward to scrape, however some are more difficult and may not be practical to scrape if you have time or budget restrictions.

For example if the website restricts how many pages each IP address can access then it could take months to download the entire website. In that case I can use proxies to provide me multiple IP addresses and download the data faster, but this can get expensive if many proxies are required.

If the website uses JavaScript and AJAX to load their data then I usually use a tool like Firebug to reverse engineer how the website works, and then call the appropriate AJAX URLs directly. And if the JavaScript is obfuscated or particularly complicated I can use a browser renderer like webkit to execute the JavaScript and provide me with the final HTML.

Another difficulty is if the website uses CAPTCHA’s or stores their data in images. Then I would need to try parsing the images with OCR or hiring people (with cheaper hourly costs) to manually interpret the images.

In summary I can always extract publicly available data from a website, but the time and cost required will vary.


Posted 20 Jul 2011 in user-agent

Your web browser will send what is known as a “User Agent” for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:

Browser User Agent
Firefox on Windows XP Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on Linux Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows Vista Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows Vista Opera/9.00 (Windows NT 5.1; U; en)
Android Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhone Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
Blackberry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllib Python-urllib/2.1
Old Google Bot Googlebot/2.1 ( http://www.googlebot.com/bot.html)
New Google Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Bot msnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo Bot Yahoo! Slurp/Site Explorer

You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

Fortunately it is easy to set your User Agent to whatever you like:

  • For FireFox you can use User Agent Switcher extension.
  • For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser –user-agent=”my custom user agent”
  • For Internet Explorer you can use the UAPick extension.
  • And for Python scripts you can set the proxy header with:

    proxy = urllib2.ProxyHandler({‘http’: IP})
    opener = urllib2.build_opener(proxy)
    opener.urlopen(‘http://www.google.com’)

Using the default User Agent for your scraper is a common reason to be blocked, so don’t forget.


Posted 05 Jul 2011 in ajax and mobile

Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don’t support JavaScript, and a simplified version for mobile users.

For example Gmail has:

All three of these interfaces will display the content of your emails but use different layouts and features. The main entrance at gmail.com is well known for its use of AJAX to load content dynamically without refreshing the page. This leads to a better user experience but makes web automation or scraping harder.

On the other hand the static HTML interface has fewer features and is less efficient for users, but much easier to automate or scrape because all the content is available when the page loads.

So before scraping a website check for its HTML or mobile version, which when exist should be easier to scrape.

To find the HTML version try disabling JavaScript in your browser and see what happens.
To find the mobile version try adding the “m” subdomain (domain.com -> m.domain.com) or using a mobile user-agent.


Posted 30 Jun 2011 in flash

Google has released a tool called Swiffy for parsing Flash files into HTML5. This is relevant to web scraping because content embedded in Flash is a pain to extract, as I wrote about earlier.

I tried some test files and found the results no more useful for parsing text content than the output produced by swf2html (Linux version). Some neat example conversions are available here. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.


Posted 19 Jun 2011 in gae and google

Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.

These are some of the downsides I have found using Google App Engine:

  • Slow - if your app has not been accessed recently (last minute) then it can take up to 10 seconds to load for the user
  • Pure Python/Java code only - this prevents using a lot of good libraries, most importantly for me lxml
  • CPU quota easily gets exhausted when uploading data
  • Proxies not supported, which makes apps that rely on external websites risky. For example the Twitter API has a per IP quota which you would be sharing with all other GAE apps.
  • Blocked in some countries, such as Turkey
  • Indexes - the free quota is 1 GB but often over half of this is taken up by indexes
  • Maximum 1000 records per query
  • 20 second request limit, so often need the overhead of using Task Queues

Despite these problems I still find Google App Engine a fantastic platform and a pleasure to develop on.


Posted 29 May 2011 in cache, crawling, and google

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

I added a function to download a URL via Google Translate and Google Cache to the webscraping library. Here is an example:

from webscraping import download, xpath  
  
D = download.Download()  
url = 'http://webscraping.com/faq'  
html1 = D.get(url) # download directly  
html2 = D.gcache_get(url) # download via Google Cache  
html3 = D.gtrans_get(url) # download via Google Translate  
for html in (html1, html2, html3):  
    print xpath.get(html, '//title')

This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Frequently asked questions | webscraping
Frequently asked questions | webscraping
Frequently asked questions | webscraping

The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.