Posted 20 Oct 2012 in big picture, concurrent, and python

This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.

Exciting news! From my perspective Python’s poor asynchronous support is its main weakness. Currently to download webpages in parallel I have to use system threads, which use a lot of memory. This limits the number of threads I can start when crawling.

To meet this shortcoming there are external solutions such as Twisted and gevent, however I found Twisted not flexible for my use and gevent unstable.

This led me to evaluate Go and Erlang, whose strength is light threads. I found these languages interesting but there are few people in their communities involved in web scraping so I would need to build much of the infrastructure myself. For now I will stick with Python.

I really hope this move by Guido goes somewhere. When Python 3 was released in 2008 I expected it to overtake Python 2 in popularity within a year, but here we are in 2012. Good async support in Python 3 would (finally) give me incentive to switch.


Posted 14 Oct 2012 in cache, crawling, and python

If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.

Here are the list of downloads available for a single webpage, amazon.com: wayback machine webpage

Or to download the webpage at a certain date:

2011: http://wayback.archive.org/web/2011/amazon.com
Jan 1st, 2011: http://web.archive.org/web/20110101/http://www.amazon.com/
Latest date: http://web.archive.org/web/2/http://www.amazon.com/

The webscraping library includes a function to download webpages from the Wayback Machine. Here is an example:

from webscraping import download, xpath
D = download.Download()
url = 'http://amazon.com'
html1 = D.get(url)
html2 = D.archive_get(url)
for html in (html1, html2):
    print xpath.get(html, '//title')

This example downloads the same webpage directly and via the Wayback Machine. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more


Posted 21 Sep 2012 in example, opensource, and python

When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.

There is a useful Firefox / Chrome extension called Wappalyzer that will tell you what technology a website has been made with. However I needed this functionality available from the command line so converted the extension into a python script, now available on bitbucket.

Here is some example usage:

>>> import builtwith
>>> builtwith('http://webscraping.com')
{'Analytics': 'Google Analytics',
 'Web server': 'Nginx',
 'JavaScript framework': 'jQuery'}
>>> builtwith('http://wordpress.com')
{'Blog': 'WordPress',
 'Analytics': 'Google Analytics',
 'CMS': 'WordPress',
 'Web server': 'Nginx',
 'JavaScript framework': 'jQuery'}
>>> builtwith('http://microsoft.com')
 {'JavaScript framework': 'Modernizr',
  'Web framework': 'Microsoft ASP.NET'}


Posted 02 Sep 2012 in business and website

sitescraper.net is now webscraping.com!

When I started in this field 3 years ago I was developing the sitescraper tool but now I use the webscraping package for most work, so the domain name change reflects this change. Also the field is commonly known as web scraping so webscraping.com is an awesome domain to have.

The old website and email addresses will be redirected to this new domain.


Posted 25 Aug 2012 in example and learn

CSV stands for comma separated values. It is a spreadsheet format where each column is separated by a comma and each row by a newline. Here is an example CSV file:

    Name,Age,Country
    John,48,United States
    Daniel,67,Germany
    Qi,25,China

You can download this file and view it in Excel, Google Docs, or even directly in a text editor. This same data saved in Excel format uses 4591 bytes and is supported by less applications.

A CSV file can be imported into a database or parsed with a programming language. This flexibility makes CSV the most common output format requested by clients for their scraped data.

Here is an example showing how to parse a CSV file with Python:

import csv
filename = 'example.csv'
reader = csv.reader(open(filename))
for row in reader:
    # display the value at the last column in this row
    print row[-1] 


Posted 09 Jul 2012 in example and python

Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.

Here is the Python script I developed to perform this conversion quickly with the pyproj module:

from pyproj import Proj, transform

v84 = Proj(proj="latlong",towgs84="0,0,0",ellps="WGS84")
v36 = Proj(proj="latlong", k=0.9996012717, ellps="airy",
        towgs84="446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894")
vgrid = Proj(init="world:bng")


def ENtoLL84(easting, northing):
    """Returns (longitude, latitude) tuple
    """
    vlon36, vlat36 = vgrid(easting, northing, inverse=True)
    return transform(v36, v84, vlon36, vlat36)

def LL84toEN(longitude, latitude):
    """Returns (easting, northing) tuple
    """
    vlon36, vlat36 = transform(v84, v36, longitude, latitude)
    return vgrid(vlon36, vlat36)


if __name__ == '__main__':
    # outputs (-1.839032626389436, 57.558101915938444)
    print ENtoLL84(409731, 852012) 

Source code is available on bitbucket.