Posted 30 Jun 2011 in flash

Google has released a tool called Swiffy for parsing Flash files into HTML5. This is relevant to web scraping because content embedded in Flash is a pain to extract, as I wrote about earlier.

I tried some test files and found the results no more useful for parsing text content than the output produced by swf2html (Linux version). Some neat example conversions are available here. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.


Posted 19 Jun 2011 in gae and google

Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.

These are some of the downsides I have found using Google App Engine:

  • Slow - if your app has not been accessed recently (last minute) then it can take up to 10 seconds to load for the user
  • Pure Python/Java code only - this prevents using a lot of good libraries, most importantly for me lxml
  • CPU quota easily gets exhausted when uploading data
  • Proxies not supported, which makes apps that rely on external websites risky. For example the Twitter API has a per IP quota which you would be sharing with all other GAE apps.
  • Blocked in some countries, such as Turkey
  • Indexes - the free quota is 1 GB but often over half of this is taken up by indexes
  • Maximum 1000 records per query
  • 20 second request limit, so often need the overhead of using Task Queues

Despite these problems I still find Google App Engine a fantastic platform and a pleasure to develop on.


Posted 29 May 2011 in cache, crawling, and google

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

I added a function to download a URL via Google Translate and Google Cache to the webscraping library. Here is an example:

from webscraping import download, xpath  
  
D = download.Download()  
url = 'http://webscraping.com/faq'  
html1 = D.get(url) # download directly  
html2 = D.gcache_get(url) # download via Google Cache  
html3 = D.gtrans_get(url) # download via Google Translate  
for html in (html1, html2, html3):  
    print xpath.get(html, '//title')

This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Frequently asked questions | webscraping
Frequently asked questions | webscraping
Frequently asked questions | webscraping

The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.


Posted 15 May 2011 in cache, crawling, and google

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

So instead of downloading a URL we want directly we can download it indirectly via http://www.google.com/search?&q=cache%3Ahttp%3A//webscraping.com. Then the source website can not block you and does not even know you are crawling their content.


Posted 10 Apr 2011 in concurrent and python

The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.

Here are examples of both approaches:

# a list of 100 webpage URLs to download
urls = [...]

# first try downloading sequentially
import urllib
for url in urls:
    urllib.urlopen(url).read()

# now try concurrently
import sys
from webscraping import download
num_threads = int(sys.argv[1])
download.threaded_get(urls=urls, delay=0, num_threads=num_threads, 
    read_cache=False, write_cache=False) # disable cache

Here are the results:

$ time python sequential.py
4m25.602s
$ time python concurrent.py 10
1m7.430s
$ time python concurrent.py 100
0m31.528s

As expected threading the downloads makes a big difference. You may have noticed the time saved is not linearly proportional to the number of threads. That is primarily because my web server struggles to keep up with all the requests. When crawling websites with threads be careful not to overload their web server by downloading too fast. Otherwise the website will become slower for others users and your IP risks being blacklisted.


Posted 30 Mar 2011 in google

Often the data sets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:

>>> gsutil mb gs://bucket_name  
>>> gsutil ls  
gs://bucket_name  
>>> gsutil cp path/to/file.ext gs://bucket_name  
>>> gsutil ls gs://bucket_name  
file.ext  
>>> gsutil cp gs://bucket_name/file.ext file_copy.ext