This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.
Exciting news! From my perspective Python’s poor asynchronous support is its main weakness. Currently to download webpages in parallel I have to use system threads, which use a lot of memory. This limits the number of threads I can start when crawling.
This led me to evaluate Go and Erlang, whose strength is light threads. I found these languages interesting but there are few people in their communities involved in web scraping so I would need to build much of the infrastructure myself. For now I will stick with Python.
I really hope this move by Guido goes somewhere. When Python 3 was released in 2008 I expected it to overtake Python 2 in popularity within a year, but here we are in 2012. Good async support in Python 3 would (finally) give me incentive to switch.
If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.
Here are the list of downloads available for a single webpage, amazon.com:
Or to download the webpage at a certain date:
|Jan 1st, 2011:||http://web.archive.org/web/20110101/http://www.amazon.com/|
The webscraping library includes a function to download webpages from the Wayback Machine. Here is an example:
This example downloads the same webpage directly and via the Wayback Machine. Then it parses the title to show the same webpage has been downloaded. The output when run is:
Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.
There is a useful Firefox / Chrome extension called Wappalyzer that will tell you what technology a website has been made with. However I needed this functionality available from the command line so converted the extension into a python script, now available on bitbucket.
Here is some example usage:
When I started in this field 3 years ago I was developing the sitescraper tool but now I use the webscraping package for most work, so the domain name change reflects this change. Also the field is commonly known as web scraping so webscraping.com is an awesome domain to have.
The old website and email addresses will be redirected to this new domain.
CSV stands for comma separated values. It is a spreadsheet format where each column is separated by a comma and each row by a newline. Here is an example CSV file:
A CSV file can be imported into a database or parsed with a programming language. This flexibility makes CSV the most common output format requested by clients for their scraped data.
Here is an example showing how to parse a CSV file with Python:
Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.
Here is the Python script I developed to perform this conversion quickly with the pyproj module:
Source code is available on bitbucket.