Posted 20 Jul 2011 in user-agent

Your web browser will send what is known as a “User Agent” for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:

Browser User Agent
Firefox on Windows XP Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on Linux Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows Vista Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows Vista Opera/9.00 (Windows NT 5.1; U; en)
Android Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhone Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
Blackberry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllib Python-urllib/2.1
Old Google Bot Googlebot/2.1 ( http://www.googlebot.com/bot.html)
New Google Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Bot msnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo Bot Yahoo! Slurp/Site Explorer

You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

Fortunately it is easy to set your User Agent to whatever you like:

  • For FireFox you can use User Agent Switcher extension.
  • For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser –user-agent=”my custom user agent”
  • For Internet Explorer you can use the UAPick extension.
  • And for Python scripts you can set the proxy header with:

    proxy = urllib2.ProxyHandler({‘http’: IP})
    opener = urllib2.build_opener(proxy)
    opener.urlopen(‘http://www.google.com’)

Using the default User Agent for your scraper is a common reason to be blocked, so don’t forget.


Posted 05 Jul 2011 in mobileajax

Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don’t support JavaScript, and a simplified version for mobile users.

For example Gmail has:

All three of these interfaces will display the content of your emails but use different layouts and features. The main entrance at gmail.com is well known for its use of AJAX to load content dynamically without refreshing the page. This leads to a better user experience but makes web automation or scraping harder.

On the other hand the static HTML interface has fewer features and is less efficient for users, but much easier to automate or scrape because all the content is available when the page loads.

So before scraping a website check for its HTML or mobile version, which when exist should be easier to scrape.

To find the HTML version try disabling JavaScript in your browser and see what happens.
To find the mobile version try adding the “m” subdomain (domain.com -> m.domain.com) or using a mobile user-agent.


Posted 30 Jun 2011 in flash

Google has released a tool called Swiffy for parsing Flash files into HTML5. This is relevant to web scraping because content embedded in Flash is a pain to extract, as I wrote about earlier.

I tried some test files and found the results no more useful for parsing text content than the output produced by swf2html (Linux version). Some neat example conversions are available here. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.


Posted 19 Jun 2011 in googlegae

Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.

These are some of the downsides I have found using Google App Engine:

  • Slow - if your app has not been accessed recently (last minute) then it can take up to 10 seconds to load for the user
  • Pure Python/Java code only - this prevents using a lot of good libraries, most importantly for me lxml
  • CPU quota easily gets exhausted when uploading data
  • Proxies not supported, which makes apps that rely on external websites risky. For example the Twitter API has a per IP quota which you would be sharing with all other GAE apps.
  • Blocked in some countries, such as Turkey
  • Indexes - the free quota is 1 GB but often over half of this is taken up by indexes
  • Maximum 1000 records per query
  • 20 second request limit, so often need the overhead of using Task Queues

Despite these problems I still find Google App Engine a fantastic platform and a pleasure to develop on.


Posted 29 May 2011 in googlecrawlingcache

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

I added a function to download a URL via Google Translate and Google Cache to the webscraping library. Here is an example:

from webscraping import download, xpath  
  
D = download.Download()  
url = 'http://webscraping.com/faq'  
html1 = D.get(url) # download directly  
html2 = D.gcache_get(url) # download via Google Cache  
html3 = D.gtrans_get(url) # download via Google Translate  
for html in (html1, html2, html3):  
    print xpath.get(html, '//title')

This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Frequently asked questions | webscraping
Frequently asked questions | webscraping
Frequently asked questions | webscraping

The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.


Posted 15 May 2011 in googlecachecrawling

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

So instead of downloading a URL we want directly we can download it indirectly via http://www.google.com/search?&q=cache%3Ahttp%3A//webscraping.com. Then the source website can not block you and does not even know you are crawling their content.