When scraping a website, typically the majority of time is spent waiting for the data to download. So to be efficient I work on multiple scraping projects simultaneously.
This can get overwhelming, so to track progress I built a web interface some time back, internally known as wsui. For large crawls I share a secret link with clients so they can also track progress and sample the scraped results.
The below screenshot shows output for a typical crawl:
The yellow graph line tracks downloads, the green the number of loads from cache, the blue the number of queued URLs waiting to be downloaded, and the red download errors. There are also statistics for the number of system threads, memory, and CPU used.
This service was built with web2py and is hosted on an ec2 server.
Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.
That doesn’t sound much fun so I created the below solution to automatically define a spreadsheet schema by analyzing determine So to address csv2mysql.py automatically parses a CSV file, creates MySQL table with appropriate field types, and then writes CSV data to the table.
Here is an example spreadsheet:
Now run the importing script:
And check the results in MySQL:
As you can see above the name has been stored as a varchar, age as an int, height as a double, dob as a date, and active as a time type.
The source code is available on bitbucket.
Based on feedback from a helpful client I have added support for discount coupons in the data store.
To kick it off the coupon WSTEST will give 25% discount to any database for the next fortnight.
This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.
Exciting news! From my perspective Python’s poor asynchronous support is its main weakness. Currently to download webpages in parallel I have to use system threads, which use a lot of memory. This limits the number of threads I can start when crawling.
This led me to evaluate Go and Erlang, whose strength is light threads. I found these languages interesting but there are few people in their communities involved in web scraping so I would need to build much of the infrastructure myself. For now I will stick with Python.
I really hope this move by Guido goes somewhere. When Python 3 was released in 2008 I expected it to overtake Python 2 in popularity within a year, but here we are in 2012. Good async support in Python 3 would (finally) give me incentive to switch.
If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.
Here are the list of downloads available for a single webpage, amazon.com:
Or to download the webpage at a certain date:
|Jan 1st, 2011:||http://web.archive.org/web/20110101/http://www.amazon.com/|
The webscraping library includes a function to download webpages from the Wayback Machine. Here is an example:
This example downloads the same webpage directly and via the Wayback Machine. Then it parses the title to show the same webpage has been downloaded. The output when run is:
Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.
There is a useful Firefox / Chrome extension called Wappalyzer that will tell you what technology a website has been made with. However I needed this functionality available from the command line so converted the extension into a python script, now available on bitbucket.
Here is some example usage: