Posted 29 Dec 2012 in business, crawling, and web2py

When scraping a website, typically the majority of time is spent waiting for the data to download. So to be efficient I work on multiple scraping projects simultaneously.

This can get overwhelming, so to track progress I built a web interface some time back, internally known as wsui. For large crawls I share a secret link with clients so they can also track progress and sample the scraped results.

The below screenshot shows output for a typical crawl:

The yellow graph line tracks downloads, the green the number of loads from cache, the blue the number of queued URLs waiting to be downloaded, and the red download errors. There are also statistics for the number of system threads, memory, and CPU used.

This service was built with web2py and is hosted on an ec2 server.


Posted 08 Dec 2012 in example, opensource, and python

Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.

That doesn’t sound much fun so I created the below solution to automatically define a spreadsheet schema by analyzing determine So to address csv2mysql.py automatically parses a CSV file, creates MySQL table with appropriate field types, and then writes CSV data to the table.

Here is an example spreadsheet:

Name  Age Height DOB        Active
John  29  180.3  1980-11-20 12:30:20
Sarah 25  174.5  1990-01-01 07:12:32
Peter 45  156.4  1965-05-02 23:09:33

Now run the importing script:

$ python csv2mysql.py --user=root --database=test --table=test test.csv
Importing `test.csv' into MySQL database `test.test'
Analyzing column types ...
['varchar(255)', 'integer', 'double', 'date', 'time']
Inserting rows ...
Committing rows to database ...
Done!

And check the results in MySQL:

$ mysql -uroot -p test
mysql> describe test;
+--------+--------------+------+-----+---------+----------------+
| Field  | Type         | Null | Key | Default | Extra          |
+--------+--------------+------+-----+---------+----------------+
| id     | int(11)      | NO   | PRI | NULL    | auto_increment |
| name   | varchar(255) | YES  |     | NULL    |                |
| age    | int(11)      | YES  |     | NULL    |                |
| height | double       | YES  |     | NULL    |                |
| dob    | date         | YES  |     | NULL    |                |
| active | time         | YES  |     | NULL    |                |
+--------+--------------+------+-----+---------+----------------+
6 rows in set (0.01 sec)

mysql> SELECT * FROM test;
+----+-------+------+--------+------------+----------+
| id | name  | age  | height | dob        | time     |
+----+-------+------+--------+------------+----------+
|  1 | John  |   29 |  180.3 | 30-10-1980 | 12:30:20 |
|  2 | Sarah |   25 |  174.5 | 01-01-1990 | 07:12:32 |
|  3 | Peter |   45 |  156.4 | 22-05-1965 | 23:09:33 |
+----+-------+------+--------+------------+----------+
3 rows in set (0.00 sec)

As you can see above the name has been stored as a varchar, age as an int, height as a double, dob as a date, and active as a time type.

The source code is available on bitbucket.


Posted 15 Nov 2012 in business and website

Based on feedback from a helpful client I have added support for discount coupons in the data store.

To kick it off the coupon WSTEST will give 25% discount to any database for the next fortnight.


Posted 20 Oct 2012 in big picture, concurrent, and python

This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.

Exciting news! From my perspective Python’s poor asynchronous support is its main weakness. Currently to download webpages in parallel I have to use system threads, which use a lot of memory. This limits the number of threads I can start when crawling.

To meet this shortcoming there are external solutions such as Twisted and gevent, however I found Twisted not flexible for my use and gevent unstable.

This led me to evaluate Go and Erlang, whose strength is light threads. I found these languages interesting but there are few people in their communities involved in web scraping so I would need to build much of the infrastructure myself. For now I will stick with Python.

I really hope this move by Guido goes somewhere. When Python 3 was released in 2008 I expected it to overtake Python 2 in popularity within a year, but here we are in 2012. Good async support in Python 3 would (finally) give me incentive to switch.


Posted 14 Oct 2012 in cache, crawling, and python

If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.

Here are the list of downloads available for a single webpage, amazon.com: wayback machine webpage

Or to download the webpage at a certain date:

2011: http://wayback.archive.org/web/2011/amazon.com
Jan 1st, 2011: http://web.archive.org/web/20110101/http://www.amazon.com/
Latest date: http://web.archive.org/web/2/http://www.amazon.com/

The webscraping library includes a function to download webpages from the Wayback Machine. Here is an example:

from webscraping import download, xpath
D = download.Download()
url = 'http://amazon.com'
html1 = D.get(url)
html2 = D.archive_get(url)
for html in (html1, html2):
    print xpath.get(html, '//title')

This example downloads the same webpage directly and via the Wayback Machine. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more


Posted 21 Sep 2012 in example, opensource, and python

When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.

There is a useful Firefox / Chrome extension called Wappalyzer that will tell you what technology a website has been made with. However I needed this functionality available from the command line so converted the extension into a python script, now available on bitbucket.

Here is some example usage:

>>> import builtwith
>>> builtwith('http://webscraping.com')
{'Analytics': 'Google Analytics',
 'Web server': 'Nginx',
 'JavaScript framework': 'jQuery'}
>>> builtwith('http://wordpress.com')
{'Blog': 'WordPress',
 'Analytics': 'Google Analytics',
 'CMS': 'WordPress',
 'Web server': 'Nginx',
 'JavaScript framework': 'jQuery'}
>>> builtwith('http://microsoft.com')
 {'JavaScript framework': 'Modernizr',
  'Web framework': 'Microsoft ASP.NET'}