In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.

My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:

Pros:

  • provides a stable and consistent platform that I can use for multiple applications
  • both the customer and I can login and manage it, so we do not need to expose our servers
  • has generous free quotas, which I rarely exhaust

Cons:

  • only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (yet)
  • limitations on maximum job time and interacting with the database
  • have to trust Google with storing our scraped data

Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I am still looking for a silver bullet!