Posted 01 Jun 2014 in efficiency, opensource, and python

I often use Google's geocoding API to find details about a location like this:

>>> from webscraping import download
>>> D = download.Download()
>>> D.geocode('-37.81,144.96')
{'address': "127-141 A'Beckett Street",
 'country': 'Australia',
 'country_code': 'AU',
 'full_address': "127-141 A'Beckett Street, Melbourne VIC 3000, Australia",
 'lat': -37.810035,
 'lng': 144.959875,
 'number': '127-141',
 'postcode': '3000',
 'state': 'Victoria',
 'state_code': 'VIC',
 'street': "A'Beckett Street",
 'suburb': 'Melbourne'}

The drawback of this approach is the Google API limits each user to 2500 requsts per 24 hours. So if I want to geocode 1 million locations then I would need to rent a lot of proxies or else the API calls will take over a year to complete (1,000,000 / 2,500 = 400 days). To meet this use case I built a module to reverse geocode a latitude / longitude coordinate using a list of known locations from geonames.

Here is some example usage:

>>> import reverse_geocode
>>> coordinates = (-37.81, 144.96), (31.76, 35.21)
>>> reverse_geocode.search(coordinates)
[{'city': 'Melbourne', 'country_code': 'AU', 'country': 'Australia'},
 {'city': 'Jerusalem', 'country_code': 'IL', 'country': 'Israel'}]

Internally the module uses a k-d tree to efficiently find the nearest neighbour of each given coordinate. On my netbook I find building the tree takes ~2.5 seconds and then each location query just ~1.5 ms.

The module is licensed under the LGPL on bitbucket: https://bitbucket.org/richardpenman/reverse_geocode


Posted 22 Oct 2013 in Business

A few friends asked me what web services I use to run my business so I am writing this to point people in future.

Email

To manage email I originally used Google Apps. When I signed up this was free for 10 users but now costs $50 per user for new accounts: http://googleenterprise.blogspot.com/2012/12/changes-to-google-apps-for-businesses.html.

I eventually found email not structured enough so built my own basic ticket system on Google App Engine. Then I came across Zendesk, which did everything my custom system supported and better designed so migrated there.

Invoicing

For invoicing I originally just sent an email with the price but quickly discovered companies require something more official looking. For version 2 I found a Google Docs template and then modified this with the client details for each project. This was cumbersome so I built a Zendesk extension to automatically generate the invoices from ticket details. For someone non-technical I would recommend Freshbooks.

To accept payments I use PayPal, which supports credit card transactions with a premier account. Recently I opened bank accounts in the US and UK to support direct bank transactions, however surprisingly most clients prefer PayPal.

PayPal provides poor currency transaction rates so I have looked for alternatives. Stripe is appealing but at the time they did not support Australia. And I tried Moneybookers but none of my clients use them and their customer support is appaling. So stuck with PayPal for now.

Hosting

For crawling I originally used my home computer. When that became too slow I rented an Amazon EC2 server but I found this overly expensive because they charge per IO operation and bandwidth. Then I tried several dedicated hosting providers and am currently on Digital Ocean.

For this website I have used WebFaction for over 5 years now and several times they have increased the server specs and features while keeping price constant. My only complaint is their changing SMTP policy caused me to miss some client emails.

For sharing files I use Google Storage. This is a similar service to S3 but I was given a year free hosting and it comes with a nice command line tool and web interface, so I have never considered migrating.

Conclusion

Basically I would recommend to start simple and invest in a better tool when you know you need it. Your business will probably fail anyway! So don't spend your time configuring a helpdesk system before you have any clients.


Posted 09 Mar 2013 in business

For the next 6 months I am going to be working onsite at a web scraping related startup in New York. Looking forward to the experience! During this time I will keep working on existing projects but limit the new projects I take on to make room for the day job.


Posted 03 Jan 2013 in opensource, python, qt, and webkit

There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.

Source code is available on bitbucket. The script depends on python 2.5+, pyqt, and the webscraping package. For Debian based distributions this will do it:

$ sudo apt-get install python-qt4
$ sudo pip install webscraping

The source of historic webpages will be the Wayback Machine, which has been caching webpages since 1996. I found that a new beta version is being tested, available here. Unfortunately the default version does not include any new caches since mid 2011 so this script will need to depend on the beta.

The Wayback Machine has a convenient URL structure based on timestamp:

http://web-beta.archive.org/web/YYYYMMDDHHMMSS/website
http://web-beta.archive.org/web/20121030031942/http://www.amazon.com/
http://web-beta.archive.org/web/20121123023424/http://www.amazon.com/

If just a partial timestamp is sent then the server will redirect to the nearest match. For example 2011 redirects to December 31, 2011:

http://web-beta.archive.org/web/2011/http://www.amazon.com/
->
http://web-beta.archive.org/web/20111231235154/http://www.amazon.com/

The first step is finding how far back the history is available for a webpage. At the top of the cached page for amazon.com is the message: http://amazon.com has been crawled 10,271 times going all the way back to December 12, 1998.

The link here contains a timestamp for the earliest date, which will be parsed into a datetime object:

def get_earliest_crawl(html):
    # extract the URL from HTML
    earliest_crawl_url = xpath.get(html, '//div[@id="wbMeta"]/p/a[2]/@href')
    # extract the timestamp from the URL
    timestamp = earliest_crawl_url.split('/')[2]
    # parse the timestamp in to a python datetime
    return datetime.datetime.strptime(timestamp, '%Y%m%d%H%M%S')

Next need to iterate from this earliest timestamp to the present and download each of the cached pages in-between. The below snippet uses increments of a year between each timestamp download:

def historical_screenshots(website):
    # the earliest archived time
    t0 = get_earliest_crawl(website)
    # the current time
    t1 = datetime.datetime.now()
    # get screenshots for each year
    delta = datetime.timedelta(days=365)
    
    while t0 <= t1:
        timestamp = t0.strftime('%Y%m%d')
        # download the cached page at this timestamp
        url = 'http://web-beta.archive.org/web/%s/%s/' % (timestamp, website)
        html = D.get(url)
        ts += delta

Webkit is then used to generate the screenshot of each downloaded webpage. To make the results prettier the Wayback toolbar is removed before rendering:

wb = webkit.WebkitBrowser(gui=True, enable_plugins=True, load_images=True)
# remove wayback toolbar
html = re.compile('<!-- BEGIN WAYBACK TOOLBAR INSERT -->.*?<!-- END WAYBACK TOOLBAR INSERT -->', re.DOTALL).sub('', html)
html = re.compile('<!--\s+FILE ARCHIVED ON.*?-->', re.DOTALL).sub('', html)
html = re.sub('http://web\.archive\.org/web/\d+/', '', html)
wb.get(url, html)
wb.screenshot(screenshot_filename)

This script includes a feature to generate a webpage with the screenshots. Here is an example history from Yahoo (follow the image link to view the complete history):

This is interesting - from 1996 to 2004 the Yahoo homepage became progressively more complex, then 2006 - 2009 drastically simpler, and then 2010 onwards returns to the traditional portal site. They must have been trying to compete with Google's simple approach but that didn't work so went back to being a media company.

Here are some more screenshot histories for Apple, Amazon, and IMDB.

These are among the better results I have generated so far. Often I found the Wayback Machine crawler was blocked or returned tacky domain squatter websites. If you find a website with an interesting history let me know.


Posted 29 Dec 2012 in business, crawling, and web2py

When scraping a website, typically the majority of time is spent waiting for the data to download. So to be efficient I work on multiple scraping projects simultaneously.

This can get overwhelming, so to track progress I built a web interface some time back, internally known as wsui. For large crawls I share a secret link with clients so they can also track progress and sample the scraped results.

The below screenshot shows output for a typical crawl:

The yellow graph line tracks downloads, the green the number of loads from cache, the blue the number of queued URLs waiting to be downloaded, and the red download errors. There are also statistics for the number of system threads, memory, and CPU used.

This service was built with web2py and is hosted on an ec2 server.


Posted 08 Dec 2012 in example, opensource, and python

Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.

That doesn't sound much fun so I created the below solution to automatically define a spreadsheet schema by analyzing determine So to address csv2mysql.py automatically parses a CSV file, creates MySQL table with appropriate field types, and then writes CSV data to the table.

Here is an example spreadsheet:

Name  Age Height DOB        Active
John  29  180.3  1980-11-20 12:30:20
Sarah 25  174.5  1990-01-01 07:12:32
Peter 45  156.4  1965-05-02 23:09:33

Now run the importing script:

$ python csv2mysql.py --user=root --database=test --table=test test.csv
Importing `test.csv' into MySQL database `test.test'
Analyzing column types ...
['varchar(255)', 'integer', 'double', 'date', 'time']
Inserting rows ...
Committing rows to database ...
Done!

And check the results in MySQL:

$ mysql -uroot -p test
mysql> describe test;
+--------+--------------+------+-----+---------+----------------+
| Field  | Type         | Null | Key | Default | Extra          |
+--------+--------------+------+-----+---------+----------------+
| id     | int(11)      | NO   | PRI | NULL    | auto_increment |
| name   | varchar(255) | YES  |     | NULL    |                |
| age    | int(11)      | YES  |     | NULL    |                |
| height | double       | YES  |     | NULL    |                |
| dob    | date         | YES  |     | NULL    |                |
| active | time         | YES  |     | NULL    |                |
+--------+--------------+------+-----+---------+----------------+
6 rows in set (0.01 sec)

mysql> SELECT * FROM test;
+----+-------+------+--------+------------+----------+
| id | name  | age  | height | dob        | time     |
+----+-------+------+--------+------------+----------+
|  1 | John  |   29 |  180.3 | 30-10-1980 | 12:30:20 |
|  2 | Sarah |   25 |  174.5 | 01-01-1990 | 07:12:32 |
|  3 | Peter |   45 |  156.4 | 22-05-1965 | 23:09:33 |
+----+-------+------+--------+------------+----------+
3 rows in set (0.00 sec)

As you can see above the name has been stored as a varchar, age as an int, height as a double, dob as a date, and active as a time type.

The source code is available on bitbucket.