Posted 06 Oct 2010 in IR

I made my own version of this technique to extract article summaries.
Source code can be found here.

The idea is simple - extract the biggest text block - but performs well.
Here are some test results:

http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1

The decision to shut down google.cn will have a limited financial impact on Google, which is based in Mountain View, Calif. China accounted for a small fraction of Google’s $23.6 billion in global revenue last year. Ads that once appeared on google.

http://www.theregister.co.uk/2010/09/29/novell_suse_appliance_1_1/

Being able to spin up appliance images for EC2 and spit them out onto the Amazon cloud meshes with Novell’s EC2-based SUSE Linux licensing, which was announced back in August. Novell is only selling priority-level (24x7) support contract for SUSE Linux li

http://webscraping.com/blog/Best-website-for-freelancers/

However with Elance there is a high barrier to entry: you have to pass a test, receive a phone call to confirm your identity, and pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people we


Posted 14 Sep 2010 in image

I needed to store a large quantities of images so took the following measurements:

Format Time Size Sample
bmp0.7156701769526
gif4.184417501931
jpg2.50781122252
png10.90944267295
ppm0.6485401769488 Browser does not support PPM
tiff1.0112161769600 Browser does not support TIFF

Gif is the clear loser - it takes a long time to process but still looks terrible.
For space use jpeg, speed ppm.

Google’s new WebP format looks promising.


Posted 06 Sep 2010 in website

I was concerned about what blind spots I might have with the way I run my business. For example I am Australian and Australian’s are usually very informal, even in a professional setting - was my communication with international clients too informal?

To try and address these concerns I developed a feedback survey with Google Docs, which I have been (politely) requesting my clients to complete at the end of a job. The results have been helpful, and it also seems to have impressed some clients that I wanted their feedback. Wish I had thought of this earlier!


Posted 27 Aug 2010 in beautifulsoup, lxml, python, scrapy, and xpath

I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.


Posted 20 Aug 2010 in business, elance, and freelancing

When I started freelancing I created accounts on every freelance site I could find (oDesk, guru, scriptlance, etc) to get as much work as possible. However I found I got almost all work from just one source - Elance. How is Elance different?

With most freelancing sites you create an account and immediately start bidding on jobs. There is no cost to bidding so people bid on many projects even if they don’t have the skill or time to complete it. This is obviously frustrating for clients who waste a lot of time sifting through bids.

On the other hand Elance has a high barrier to entry: you have to pass a test to show you understand their system, then receive a phone call to confirm your identity, and when established pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people weren’t willing to waste their money bidding for a job they can’t do. This barrier serves to weed out less serious freelancers so that the average bid is of higher quality.

From my experience the clients are different on Elance too. On most freelancing sites the client is trying to get the job done for the smallest amount of money possible and are often willing to spend their time sifting through dozens of proposals, hoping to get lucky. Elance seems to attract clients who consider their time valuable and are willing to pay a premium for good service.
Often clients contact me directly through Elance because I am native English and want to avoid potential communication or cultural problems. One client even requested me to double my bid because “we are not cheap!”

After a year of freelancing I now get the majority of work directly through my website, but still get a decent percentage of clients through Elance.

My advice for new freelancers - focus on building your Elance profile and don’t waste your time with the others. (Though do let me know if you have had good experience elsewhere.)


Posted 24 Jul 2010 in website

Regarding the title of this blog “All your data are belong to us” - I realized not everyone gets the reference. See this wikipedia article for an explanation.