There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.
$ sudo apt-get install python-qt4 $ sudo pip install webscraping
The source of historic webpages will be the Wayback Machine, which has been caching webpages since 1996. I found that a new beta version is being tested, available here. Unfortunately the default version does not include any new caches since mid 2011 so this script will need to depend on the beta.
The Wayback Machine has a convenient URL structure based on timestamp:
http://web-beta.archive.org/web/YYYYMMDDHHMMSS/website http://web-beta.archive.org/web/20121030031942/http://www.amazon.com/ http://web-beta.archive.org/web/20121123023424/http://www.amazon.com/
If just a partial timestamp is sent then the server will redirect to the nearest match. For example 2011 redirects to December 31, 2011:
http://web-beta.archive.org/web/2011/http://www.amazon.com/ -> http://web-beta.archive.org/web/20111231235154/http://www.amazon.com/
The first step is finding how far back the history is available for a webpage. At the top of the cached page for amazon.com is the message: http://amazon.com has been crawled 10,271 times going all the way back to December 12, 1998.
The link here contains a timestamp for the earliest date, which will be parsed into a datetime object:
def get_earliest_crawl(html): # extract the URL from HTML earliest_crawl_url = xpath.get(html, '//div[@id="wbMeta"]/p/a/@href') # extract the timestamp from the URL timestamp = earliest_crawl_url.split('/') # parse the timestamp in to a python datetime return datetime.datetime.strptime(timestamp, '%Y%m%d%H%M%S')
Next need to iterate from this earliest timestamp to the present and download each of the cached pages in-between. The below snippet uses increments of a year between each timestamp download:
def historical_screenshots(website): # the earliest archived time t0 = get_earliest_crawl(website) # the current time t1 = datetime.datetime.now() # get screenshots for each year delta = datetime.timedelta(days=365) while t0 <= t1: timestamp = t0.strftime('%Y%m%d') # download the cached page at this timestamp url = 'http://web-beta.archive.org/web/%s/%s/' % (timestamp, website) html = D.get(url) ts += delta
Webkit is then used to generate the screenshot of each downloaded webpage. To make the results prettier the Wayback toolbar is removed before rendering:
wb = webkit.WebkitBrowser(gui=True, enable_plugins=True, load_images=True) # remove wayback toolbar html = re.compile('<!-- BEGIN WAYBACK TOOLBAR INSERT -->.*?<!-- END WAYBACK TOOLBAR INSERT -->', re.DOTALL).sub('', html) html = re.compile('<!--\s+FILE ARCHIVED ON.*?-->', re.DOTALL).sub('', html) html = re.sub('http://web\.archive\.org/web/\d+/', '', html) wb.get(url, html) wb.screenshot(screenshot_filename)
This script includes a feature to generate a webpage with the screenshots. Here is an example history from Yahoo (follow the image link to view the complete history):
This is interesting - from 1996 to 2004 the Yahoo homepage became progressively more complex, then 2006 - 2009 drastically simpler, and then 2010 onwards returns to the traditional portal site. They must have been trying to compete with Google's simple approach but that didn't work so went back to being a media company.
These are among the better results I have generated so far. Often I found the Wayback Machine crawler was blocked or returned tacky domain squatter websites. If you find a website with an interesting history let me know.
When scraping a website, typically the majority of time is spent waiting for the data to download. So to be efficient I work on multiple scraping projects simultaneously.
This can get overwhelming, so to track progress I built a web interface some time back, internally known as wsui. For large crawls I share a secret link with clients so they can also track progress and sample the scraped results.
The below screenshot shows output for a typical crawl:
The yellow graph line tracks downloads, the green the number of loads from cache, the blue the number of queued URLs waiting to be downloaded, and the red download errors. There are also statistics for the number of system threads, memory, and CPU used.
This service was built with web2py and is hosted on an ec2 server.
Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.
That doesn't sound much fun so I created the below solution to automatically define a spreadsheet schema by analyzing determine So to address csv2mysql.py automatically parses a CSV file, creates MySQL table with appropriate field types, and then writes CSV data to the table.
Here is an example spreadsheet:
Name Age Height DOB Active John 29 180.3 1980-11-20 12:30:20 Sarah 25 174.5 1990-01-01 07:12:32 Peter 45 156.4 1965-05-02 23:09:33
Now run the importing script:
$ python csv2mysql.py --user=root --database=test --table=test test.csv Importing `test.csv' into MySQL database `test.test' Analyzing column types ... ['varchar(255)', 'integer', 'double', 'date', 'time'] Inserting rows ... Committing rows to database ... Done!
And check the results in MySQL:
$ mysql -uroot -p test mysql> describe test; +--------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(255) | YES | | NULL | | | age | int(11) | YES | | NULL | | | height | double | YES | | NULL | | | dob | date | YES | | NULL | | | active | time | YES | | NULL | | +--------+--------------+------+-----+---------+----------------+ 6 rows in set (0.01 sec) mysql> SELECT * FROM test; +----+-------+------+--------+------------+----------+ | id | name | age | height | dob | time | +----+-------+------+--------+------------+----------+ | 1 | John | 29 | 180.3 | 30-10-1980 | 12:30:20 | | 2 | Sarah | 25 | 174.5 | 01-01-1990 | 07:12:32 | | 3 | Peter | 45 | 156.4 | 22-05-1965 | 23:09:33 | +----+-------+------+--------+------------+----------+ 3 rows in set (0.00 sec)
As you can see above the name has been stored as a varchar, age as an int, height as a double, dob as a date, and active as a time type.
The source code is available on bitbucket.
Based on feedback from a helpful client I have added support for discount coupons in the data store.
To kick it off the coupon WSTEST will give 25% discount to any database for the next fortnight.
This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.
Exciting news! From my perspective Python's poor asynchronous support is its main weakness. Currently to download webpages in parallel I have to use system threads, which use a lot of memory. This limits the number of threads I can start when crawling.
This led me to evaluate Go and Erlang, whose strength is light threads. I found these languages interesting but there are few people in their communities involved in web scraping so I would need to build much of the infrastructure myself. For now I will stick with Python.
I really hope this move by Guido goes somewhere. When Python 3 was released in 2008 I expected it to overtake Python 2 in popularity within a year, but here we are in 2012. Good async support in Python 3 would (finally) give me incentive to switch.