Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.

That doesn’t sound much fun so I created the below solution to automatically define a spreadsheet schema by analyzing determine So to address automatically parses a CSV file, creates MySQL table with appropriate field types, and then writes CSV data to the table.

Here is an example spreadsheet:

Name  Age Height DOB        Active
John  29  180.3  1980-11-20 12:30:20
Sarah 25  174.5  1990-01-01 07:12:32
Peter 45  156.4  1965-05-02 23:09:33

Now run the importing script:

$ python --user=root --database=test --table=test test.csv
Importing `test.csv' into MySQL database `test.test'
Analyzing column types ...
['varchar(255)', 'integer', 'double', 'date', 'time']
Inserting rows ...
Committing rows to database ...

And check the results in MySQL:

$ mysql -uroot -p test
mysql> describe test;
| Field  | Type         | Null | Key | Default | Extra          |
| id     | int(11)      | NO   | PRI | NULL    | auto_increment |
| name   | varchar(255) | YES  |     | NULL    |                |
| age    | int(11)      | YES  |     | NULL    |                |
| height | double       | YES  |     | NULL    |                |
| dob    | date         | YES  |     | NULL    |                |
| active | time         | YES  |     | NULL    |                |
6 rows in set (0.01 sec)

mysql> SELECT * FROM test;
| id | name  | age  | height | dob        | time     |
|  1 | John  |   29 |  180.3 | 30-10-1980 | 12:30:20 |
|  2 | Sarah |   25 |  174.5 | 01-01-1990 | 07:12:32 |
|  3 | Peter |   45 |  156.4 | 22-05-1965 | 23:09:33 |
3 rows in set (0.00 sec)

As you can see above the name has been stored as a varchar, age as an int, height as a double, dob as a date, and active as a time type.

The source code is available on bitbucket.

Based on feedback from a helpful client I have added support for discount coupons in the data store.

To kick it off the coupon WSTEST will give 25% discount to any database for the next fortnight.

This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.

Exciting news! From my perspective Python’s poor asynchronous support is its main weakness. Currently to download webpages in parallel I have to use system threads, which use a lot of memory. This limits the number of threads I can start when crawling.

To meet this shortcoming there are external solutions such as Twisted and gevent, however I found Twisted not flexible for my use and gevent unstable.

This led me to evaluate Go and Erlang, whose strength is light threads. I found these languages interesting but there are few people in their communities involved in web scraping so I would need to build much of the infrastructure myself. For now I will stick with Python.

I really hope this move by Guido goes somewhere. When Python 3 was released in 2008 I expected it to overtake Python 2 in popularity within a year, but here we are in 2012. Good async support in Python 3 would (finally) give me incentive to switch.

If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at, which has been crawling and caching webpages since 1998.

Here are the list of downloads available for a single webpage, wayback machine webpage

Or to download the webpage at a certain date:

Jan 1st, 2011:
Latest date:

The webscraping library includes a function to download webpages from the Wayback Machine. Here is an example:

from webscraping import download, xpath
D = download.Download()
url = ''
html1 = D.get(url)
html2 = D.archive_get(url)
for html in (html1, html2):
    print xpath.get(html, '//title')

This example downloads the same webpage directly and via the Wayback Machine. Then it parses the title to show the same webpage has been downloaded. The output when run is: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more

When crawling websites it can be useful to know what technology has been used to develop a website. For example with a website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.

There is a useful Firefox / Chrome extension called Wappalyzer that will tell you what technology a website has been made with. However I needed this functionality available from the command line so converted the extension into a python script, now available on bitbucket.

Here is some example usage:

>>> import builtwith
>>> builtwith('')
{'Analytics': 'Google Analytics',
 'Web server': 'Nginx',
 'JavaScript framework': 'jQuery'}
>>> builtwith('')
{'Blog': 'WordPress',
 'Analytics': 'Google Analytics',
 'CMS': 'WordPress',
 'Web server': 'Nginx',
 'JavaScript framework': 'jQuery'}
>>> builtwith('')
 {'JavaScript framework': 'Modernizr',
  'Web framework': 'Microsoft ASP.NET'} is now!

When I started in this field 3 years ago I was developing the sitescraper tool but now I use the webscraping package for most work, so the domain name change reflects this change. Also the field is commonly known as web scraping so is an awesome domain to have.

The old website and email addresses will be redirected to this new domain.