Posted 20 Feb 2011 in CAPTCHA

By now you would be used to entering the text for an image like this:

The idea is this will prevent bots because only a real user can interpret the image.

However this is not an obstacle for a determined scraper because of services like deathbycaptcha that will solve the CAPTCHA for you. These services use cheap labor to manually interpret the images and send the result back through an API.

CAPTCHA’s are still useful because they deter most bots. However they can not prevent a determined scraper and are annoying to genuine users.


Posted 06 Nov 2010 in business

An ongoing problem for my web scraping work is how much to quote for a job. I prefer fixed fee to hourly rates so I need to consider the complexity upfront. My initial strategy was simply to quote low to ensure I got business and hopefully build up some regular clients.

Through experience I found the following factors most effected the time required for a job:

  • Website size
  • Login protected
  • IP restrictions
  • HTML quality
  • JavaScript/AJAX

I developed a formula based on these factors and have now built an interface that lets potential clients clarify the costs involved with different kinds of web scraping jobs. Additionally I hope this will reduce the communication overhead by helping clients to provide the necessary information upfront.


Posted 27 Oct 2010 in gae

Google App Engine provides generous free quotas for your app and additional paid quotas.
I always enable billing for my GAE apps even though I rarely exhaust the free quotas. Enabling billing and setting paid quotas does not mean you have to pay anything and in fact increases what you get for free.

Here is a screenshot of the billing panel:

GAE lets you allocate a daily budget to the various resources, with the minimum permitted budget being USD $1. When you exhaust a free quota you will only be charged for the budget allocated to it. In the above screenshot I have allocated all my budget to emailing, but since my app does use the Mail API I can be confident this free quota will never be exhausted and I will never pay a cent. For another app that does use Mail I have allocated all the budget to Bandwidth Out instead.

Now with billing enabled my app:

  • can access the Blobstore API to store larger amounts of data
  • enjoys much higher free limits for the Mail, Task Queue, and UrlFetch API’s - for example by default an app can make 7000 Mail API calls but with billing enabled this limit jumps to 1,700,000 calls
  • has a higher per minute CPU limit, which I find particularly useful when uploading a mass of records to the Datastore

So in summary you can enable billing to extend your free quotas without risk of paying.


Posted 06 Oct 2010 in IR

I made my own version of this technique to extract article summaries.
Source code can be found here.

The idea is simple - extract the biggest text block - but performs well.
Here are some test results:

http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1

The decision to shut down google.cn will have a limited financial impact on Google, which is based in Mountain View, Calif. China accounted for a small fraction of Google’s $23.6 billion in global revenue last year. Ads that once appeared on google.

http://www.theregister.co.uk/2010/09/29/novell_suse_appliance_1_1/

Being able to spin up appliance images for EC2 and spit them out onto the Amazon cloud meshes with Novell’s EC2-based SUSE Linux licensing, which was announced back in August. Novell is only selling priority-level (24x7) support contract for SUSE Linux li

http://webscraping.com/blog/Best-website-for-freelancers/

However with Elance there is a high barrier to entry: you have to pass a test, receive a phone call to confirm your identity, and pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people we


Posted 14 Sep 2010 in image

I needed to store a large quantities of images so took the following measurements:

Format Time Size Sample
bmp0.7156701769526
gif4.184417501931
jpg2.50781122252
png10.90944267295
ppm0.6485401769488 Browser does not support PPM
tiff1.0112161769600 Browser does not support TIFF

Gif is the clear loser - it takes a long time to process but still looks terrible.
For space use jpeg, speed ppm.

Google’s new WebP format looks promising.


Posted 06 Sep 2010 in website

I was concerned about what blind spots I might have with the way I run my business. For example I am Australian and Australian’s are usually very informal, even in a professional setting - was my communication with international clients too informal?

To try and address these concerns I developed a feedback survey with Google Docs, which I have been (politely) requesting my clients to complete at the end of a job. The results have been helpful, and it also seems to have impressed some clients that I wanted their feedback. Wish I had thought of this earlier!