Google recently released the Arc Welder extension for Chrome, which allows an Android app to be run on the desktop. The aim of Arc Welder is to help make testing Android apps easier, but conveniently it also makes scraping Android apps easier too.
Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.
These are some of the downsides I have found using Google App Engine:
I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.
One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:
Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.
Fortunately there is an alternative - Google.
If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.
Google March 30, 2011
Often the data sets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:
>>> gsutil mb gs://bucket_name >>> gsutil ls gs://bucket_name >>> gsutil cp path/to/file.ext gs://bucket_name >>> gsutil ls gs://bucket_name file.ext >>> gsutil cp gs://bucket_name/file.ext file_copy.ext
You spent time and money collecting the data in your website so you want to prevent someone else downloading and reusing it. However you still want Google to index your website so that people can find you. This is a common problem. Below I will outline some strategies to protect your data.