In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.
My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:
Pros:
- provides a stable and consistent platform that I can use for multiple applications
- both the customer and I can login and manage it, so we do not need to expose our servers
- has generous free quotas, which I rarely exhaust
Cons:
- only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (yet)
- limitations on maximum job time and interacting with the database
- have to trust Google with storing our scraped data
Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I am still looking for a silver bullet!