I tried some test files and found the results no more useful for parsing text content than the output produced by swf2html (Linux version). Some neat example conversions are available here. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.
Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.
These are some of the downsides I have found using Google App Engine:
- Slow - if your app has not been accessed recently (last minute) then it can take up to 10 seconds to load for the user
- Pure Python/Java code only - this prevents using a lot of good libraries, most importantly for me lxml
- CPU quota easily gets exhausted when uploading data
- Proxies not supported, which makes apps that rely on external websites risky. For example the Twitter API has a per IP quota which you would be sharing with all other GAE apps.
- Blocked in some countries, such as Turkey
- Indexes - the free quota is 1 GB but often over half of this is taken up by indexes
Maximum 1000 records per query
- 20 second request limit, so often need the overhead of using Task Queues
Despite these problems I still find Google App Engine a fantastic platform and a pleasure to develop on.
I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.
One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:
I added a function to download a URL via Google Translate and Google Cache to the webscraping library. Here is an example:
This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:
Frequently asked questions | webscraping Frequently asked questions | webscraping Frequently asked questions | webscraping
The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.
Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.
Fortunately there is an alternative - Google.
If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.
So instead of downloading a URL we want directly we can download it indirectly via http://www.google.com/search?&q=cache%3Ahttp%3A//webscraping.com. Then the source website can not block you and does not even know you are crawling their content.
The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.
Here are examples of both approaches:
Here are the results:
$ time python sequential.py 4m25.602s $ time python concurrent.py 10 1m7.430s $ time python concurrent.py 100 0m31.528s
As expected threading the downloads makes a big difference. You may have noticed the time saved is not linearly proportional to the number of threads. That is primarily because my web server struggles to keep up with all the requests. When crawling websites with threads be careful not to overload their web server by downloading too fast. Otherwise the website will become slower for others users and your IP risks being blacklisted.
Often the data sets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:
>>> gsutil mb gs://bucket_name >>> gsutil ls gs://bucket_name >>> gsutil cp path/to/file.ext gs://bucket_name >>> gsutil ls gs://bucket_name file.ext >>> gsutil cp gs://bucket_name/file.ext file_copy.ext