WebScraping.com Logo

Blog

  • Google App Engine limitations

    Google Gae

    Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.

    These are some of the downsides I have found using Google App Engine:

  • Using Google Translate to crawl a website

    Google Crawling Cache

    I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

    One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

  • Using Google Cache to crawl a website

    Google Cache Crawling

    Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

    Fortunately there is an alternative - Google.

    If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

  • Crawling with threads

    Concurrent Python

    The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.

  • Google Storage

    Google

    Often the data sets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
    Here is an example snippet to create a folder on GS, upload a file, and then download it:

    >>> gsutil mb gs://bucket_name  
    >>> gsutil ls  
    gs://bucket_name  
    >>> gsutil cp path/to/file.ext gs://bucket_name  
    >>> gsutil ls gs://bucket_name  
    file.ext  
    >>> gsutil cp gs://bucket_name/file.ext file_copy.ext
    
  • Automating CAPTCHA's

    Captcha

    By now you would be used to entering the text for an image like this:

  • Automated quote tool

    Business

    An ongoing problem for my web scraping work is how much to quote for a job. I prefer fixed fee to hourly rates so I need to consider the complexity upfront. My initial strategy was simply to quote low to ensure I got business and hopefully build up some regular clients.

    Through experience I found the following factors most effected the time required for a job:

  • Increase your Google App Engine quotas for free

    Gae

    Google App Engine provides generous free quotas for your app and additional paid quotas.
    I always enable billing for my GAE apps even though I rarely exhaust the free quotas. Enabling billing and setting paid quotas does not mean you have to pay anything and in fact increases what you get for free.

    Here is a screenshot of the billing panel:

  • Extracting article summaries

    Information retrieval

    I made my own version of this technique to extract article summaries.
    Source code can be found here.

    The idea is simple - extract the biggest text block - but performs well.
    Here are some test results:

    http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1

  • Image efficiencies

    Image

    I needed to store a large quantities of images so took the following measurements: