Posted 10 Apr 2011 in concurrentpython

The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.

Here are examples of both approaches:

# a list of 100 webpage URLs to download
urls = [...]

# first try downloading sequentially
import urllib
for url in urls:
    urllib.urlopen(url).read()

# now try concurrently
import sys
from webscraping import download
num_threads = int(sys.argv[1])
download.threaded_get(urls=urls, delay=0, num_threads=num_threads, 
    read_cache=False, write_cache=False) # disable cache

Here are the results:

$ time python sequential.py
4m25.602s
$ time python concurrent.py 10
1m7.430s
$ time python concurrent.py 100
0m31.528s

As expected threading the downloads makes a big difference. You may have noticed the time saved is not linearly proportional to the number of threads. That is primarily because my web server struggles to keep up with all the requests. When crawling websites with threads be careful not to overload their web server by downloading too fast. Otherwise the website will become slower for others users and your IP risks being blacklisted.


Posted 30 Mar 2011 in google

Often the data sets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:

>>> gsutil mb gs://bucket_name  
>>> gsutil ls  
gs://bucket_name  
>>> gsutil cp path/to/file.ext gs://bucket_name  
>>> gsutil ls gs://bucket_name  
file.ext  
>>> gsutil cp gs://bucket_name/file.ext file_copy.ext


Posted 20 Feb 2011 in CAPTCHA

By now you would be used to entering the text for an image like this:

The idea is this will prevent bots because only a real user can interpret the image.

However this is not an obstacle for a determined scraper because of services like deathbycaptcha that will solve the CAPTCHA for you. These services use cheap labor to manually interpret the images and send the result back through an API.

CAPTCHA’s are still useful because they deter most bots. However they can not prevent a determined scraper and are annoying to genuine users.


Posted 06 Nov 2010 in business

An ongoing problem for my web scraping work is how much to quote for a job. I prefer fixed fee to hourly rates so I need to consider the complexity upfront. My initial strategy was simply to quote low to ensure I got business and hopefully build up some regular clients.

Through experience I found the following factors most effected the time required for a job:

  • Website size
  • Login protected
  • IP restrictions
  • HTML quality
  • JavaScript/AJAX

I developed a formula based on these factors and have now built an interface that lets potential clients clarify the costs involved with different kinds of web scraping jobs. Additionally I hope this will reduce the communication overhead by helping clients to provide the necessary information upfront.


Posted 27 Oct 2010 in gae

Google App Engine provides generous free quotas for your app and additional paid quotas.
I always enable billing for my GAE apps even though I rarely exhaust the free quotas. Enabling billing and setting paid quotas does not mean you have to pay anything and in fact increases what you get for free.

Here is a screenshot of the billing panel:

GAE lets you allocate a daily budget to the various resources, with the minimum permitted budget being USD $1. When you exhaust a free quota you will only be charged for the budget allocated to it. In the above screenshot I have allocated all my budget to emailing, but since my app does use the Mail API I can be confident this free quota will never be exhausted and I will never pay a cent. For another app that does use Mail I have allocated all the budget to Bandwidth Out instead.

Now with billing enabled my app:

  • can access the Blobstore API to store larger amounts of data
  • enjoys much higher free limits for the Mail, Task Queue, and UrlFetch API’s - for example by default an app can make 7000 Mail API calls but with billing enabled this limit jumps to 1,700,000 calls
  • has a higher per minute CPU limit, which I find particularly useful when uploading a mass of records to the Datastore

So in summary you can enable billing to extend your free quotas without risk of paying.


Posted 06 Oct 2010 in IR

I made my own version of this technique to extract article summaries.
Source code can be found here.

The idea is simple - extract the biggest text block - but performs well.
Here are some test results:

http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1

The decision to shut down google.cn will have a limited financial impact on Google, which is based in Mountain View, Calif. China accounted for a small fraction of Google’s $23.6 billion in global revenue last year. Ads that once appeared on google.

http://www.theregister.co.uk/2010/09/29/novell_suse_appliance_1_1/

Being able to spin up appliance images for EC2 and spit them out onto the Amazon cloud meshes with Novell’s EC2-based SUSE Linux licensing, which was announced back in August. Novell is only selling priority-level (24x7) support contract for SUSE Linux li

http://webscraping.com/blog/Best-website-for-freelancers/

However with Elance there is a high barrier to entry: you have to pass a test, receive a phone call to confirm your identity, and pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people we