Posted 02 Mar 2010 in chickenfoot and javascript

The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
This can make scraping a little tougher, but not impossible.

The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

To get a feel for Chickenfoot here is an example to crawl a website:

// crawl given website url recursively to given depth  
function crawl(website, max_depth, links) {  
  if(!links) {  
    links = {};  
    go(website);  
    links[website] = 1;  
  }  
  
  // TODO: insert code to act on current webpage here
  if(max_depth > 0) {  
    // iterate links  
    for(var link=find("link"); link.hasMatch; link=link.next) {    
      url = link.element.href;  
      if(!links[url]) {  
        if(url.indexOf(website) == 0) {  
          // same domain  
          go(url);  
          links[url] = 1;  
          crawl(website, max_depth - 1, links);  
        }  
      }  
    }  
  }  
  back(); wait();  
}

This is part of a script I built on my Linux machine for a client on Windows and it worked fine for both of us. To find out more about Chickenfoot check out their video.

Chickenfoot is a useful weapon in my web scraping arsenal, particularly for quick jobs with a low to medium amount of data. For larger websites there is a more suitable alternative, which I will cover in the next post.


Posted 08 Feb 2010 in crawling, proxies, and user-agent

Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don’t (generally) want to be crawled by others. One such company is Google, ironically.

Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.

Speed

If you download 1 webpage a day then you will not be blocked but your crawl would take too long to be useful. If you instead used threading to crawl multiple URLs asynchronously then they might mistake you for a DOS attack and blacklist your IP. So what is the happy medium? The wikipedia article on web crawlers currently states Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 34 minutes. This is a little slow and I have found 1 download every 5 seconds is usually fine. If you don’t need the data quickly then use a longer delay to reduce your risk and be kinder to their server.

Identity

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot (only for the brave): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

If you have access to multiple IP addresses (for example via proxies) then distribute your requests among them so that it appears your downloading comes from multiple users.

Consistency

Avoid accessing webpages sequentially: /product/1, /product/2, etc. And don’t download a new webpage exactly every N seconds. Both of these mistakes can attract attention to your downloading because a real user browses more randomly. So make sure to crawl webpages in an unordered manner and add a random offset to the delay between downloads.

Following these recommendations will allow you to crawl most websites without being detected.


Posted 05 Feb 2010 in CAPTCHA, IP, OCR, and google

You spent time and money collecting the data in your website so you want to prevent someone else downloading and reusing it. However you still want Google to index your website so that people can find you. This is a common problem. Below I will outline some strategies to protect your data.

Restrict

Firstly if your data really is valuable then perhaps it shouldn’t be all publicly available. Often websites will display the basic data to standard users / search engines and the more valuable data (such as emails) only to logged in users. Then the website can easily track and control how much valuable data each account is accessing.

If requiring accounts isn’t practical and you want search engines to crawl your content then realistically you can’t prevent it being scraped, but you can discourage scrapers by setting a high enough barrier.

Obfuscate

Scrapers typically work by downloading the HTML for a URL and then extracting out the desired content. To make this process harder you can obfuscate your valuable data.

The simplest way to obfuscate your data is have it encoded on the server and then dynamically decoded with JavaScript in the client’s browser. The scraper would then need to decode this JavaScript to extract the original data. This is not difficult for an experienced scraper, but would atleast provide a small barrier.

A better way is to encapsulate the key data within images or flash. Optical Character Recognition (OCR) techniques would then need to be used to extract the original data, which require a lot of effort to do accurately. (Make sure the URL of the image does not reveal the original data, as one website did!) The free OCR tools that I have tested are at best 80% accurate, which makes the resulting data useless.
The tradeoff with encoding data in images images is there will be more data for the client to download and they prevent genuine users from conveniently copying the text. For example people often display their email address within an image to combat spammers, which then forces everyone else to type it out manually.

Challenge

A popular way to prevent automated scrapers is by forcing users to pass a CAPTCHA. For example Google does this when it gets too many search requests from the same IP within a time frame. To avoid the CAPTCHA the scraper could proceed slowly, but they probably can’t afford to wait. To speed up this rate they may purchase multiple anonymous proxies to provide multiple IP addresses, but that is expensive - 10 anonymous proxies will cost ~$30 / month to rent. The CAPTCHA can also be solved automatically by a service like deathbycaptcha.com. This takes some effort to setup so would only be implemented by experienced scrapers for valuable data.

CAPTCHA is not a good solution for protecting your content - they annoy genuine users, can be bypassed by a determined scraper, and additionally make it difficult for the Google Bot to index your website properly. They are only a good solution when being indexed by Google is not a priority and you want to stop most scrapers.

Corrupt

If you are suspicious of an IP that is accessing your website you could block the IP, but then they would know they are detected and try a different approach. Instead you could allow the IP to continue downloading but return incorrect text or figures. This should be done subtly so that is not clear which data is correct and their entire data set will be corrupted. Perhaps they won’t notice and you will be able to track them down later by searching for purple monkey dishwasher or whatever other content was inserted!

Structure

Another factor that makes sites easy to scrape is when they use a URL structure like:
domain/product/product_title/product_id
For example these two URLs point to the same content on Amazon:

The title is just to make the URL look pretty. This makes the site easy to crawl because the scraper can just iterate through all the ID’s (in this case ISBN). If the title here had to be consistent with the product ID then it would take more work to scrape.

Google

All of the above strategies could be ignored for the Google Bot to ensure your website is properly indexed. Be aware that anyone could pretend to be the Google Bot by setting their user-agent to Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html), so to be confident you should also verify the IP address via a reversed DNS lookup. Be warned that Google has been known to punish websites that display different content for their bot to regular users.

In the next post I will take the opposite point of view of someone trying to scrape a website.


Posted 02 Feb 2010 in python

Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote an article on why he likes Python that I expect resonates with many.

Additionally Python is an interpreted language so it is easier for me to distribute my solutions to clients than would be for a compiled language like C. Most of my scraping jobs are relatively small so distribution overhead is important.

A few people have suggested I use ruby instead. I have used ruby and like it, but found it lacks the depth of libraries available to Python.

However Python is by no means perfect - for example there are limitations with threading, using unicode is awkward, and distributing on Windows can be difficult. And there are also many redundant or poorly designed builtin libraries. Some of these issues are being addressed in Python 3, some not.

If I was to ever change language I expect it would be to something more equipped for parallel programming like erlang or haskell.


Posted 29 Jan 2010 in opensource and sitescraper

As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.

My goal for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The idea was to build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.

The tool was eventually called sitescraper and is available for download on bitbucket. For more information have a browse of this paper, which covers the implementation and results in detail.

I use sitescraper for much of my scraping work and sometimes make updates based on experience gained from a project. Here is some example usage:

>>> from sitescraper import sitescraper
>>> ss = sitescraper()  
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?
            url=search-alias%3Daps&field-keywords=python&x=0&y=0'  
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition",   
  "Programming in Python 3: A Complete Introduction to the Python Language",
  "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]  
>>> ss.add(url, data)  
>>> # we can add multiple example cases,
>>> # but this is a simple example so one will do (I generally use 3)  
>>> # ss.add(url2, data2)   
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?
                url=search-alias%3Daps&field-keywords=linux&x=0&y=0')  
["Amazon.com: linux", [
    "A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", 
    "Linux Pocket Guide", 
    "Linux in a Nutshell (In a Nutshell (O'Reilly))", 
    'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 
    'Linux Bible, 2008 Edition'
]]


Posted 20 Jan 2010 in python and regex

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:

import re
import time
import urllib2
from BeautifulSoup import BeautifulSoup
from lxml import html as lxmlhtml


def timeit(fn, *args):
    t1 = time.time()
    for i in range(100):
        fn(*args)
    t2 = time.time()
    print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
    
    
def bs_test(html):
    soup = BeautifulSoup(html)
    return soup.html.head.title
    
def lxml_test(html):
    tree = lxmlhtml.fromstring(html)
    return tree.xpath('//title')[0].text_content()
    
def regex_test(html):
    return re.findall('<title>(.*?)</title>', html)[0]
    
    
if __name__ == '__main__':
    url = 'http://webscraping.com/blog/Web-scraping-with-regular-expressions/'
    html = urllib2.urlopen(url).read()
    for fn in (bs_test, lxml_test, regex_test):
        timeit(fn, html)

The results are:

regex_test took 40.032 ms
lxml_test took 1863.463 ms
bs_test took 54206.303 ms

That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.

XPaths are very useful for most web scraping tasks, but there is still a use case for regular expressions.