Posted 15 May 2015 in database

A significant update to the Android Apps database is now ready, which now contains over 2 million apps (2,130,732 to be exact). If you have purchased this database previously you can login to your account to download the updated version for free.


Posted 24 Apr 2015 in database

The latest version of the UPC database now contains over 7.5 million products, which is over a million more than the previous version. If you have purchased this database previously you can login to your account to download the updated version for free.


Posted 21 Apr 2015 in business

I searched my email and found over the last few years I received 76 messages from clients containing the text Web Scrapping rather than the usual spelling Web Scraping. And this is not unique to my clients - currently Google has 122,000 results for "Web Scrapping" compared to 447,000 results for "Web Scraping" - the correct spelling returns only 4x the number of results. So in light of this common spelling mistake I registered the domain webscrapping.com and redirected it here.


Posted 15 Apr 2015 in cookies, example, and python

Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.

I have now packaged this solution as an open source python module. Here is some example usage:

>>> from webscraping import common, xpath
>>> import requests
>>> import browser_cookie
>>> cj = browser_cookie.load()
>>> r = requests.get('https://bitbucket.org/', cookies=cj)
>>> common.normalize(xpath.get(r.content, '//title'))
'richardpenman / home — Bitbucket'

If you have a bitbucket account and are logged in in a supported browser then you should see your account name printed here. Currently Firefox (Linux/OSX/Windows) and Chrome (Linux/OSX) are supported and I will add more platforms if get the chance to test.


Posted 16 Mar 2015 in database

The latest version of the Apple Apps database now contains 978,698 apps, which is 200,000 more than the previous version. If you have purchased this database previously you can login to your account to download the latest version for free.


Posted 01 Jun 2014 in efficiency, opensource, and python

I often use Google's geocoding API to find details about a location like this:

>>> from webscraping import download
>>> D = download.Download()
>>> D.geocode('-37.81,144.96')
{'address': "127-141 A'Beckett Street",
 'country': 'Australia',
 'country_code': 'AU',
 'full_address': "127-141 A'Beckett Street, Melbourne VIC 3000, Australia",
 'lat': -37.810035,
 'lng': 144.959875,
 'number': '127-141',
 'postcode': '3000',
 'state': 'Victoria',
 'state_code': 'VIC',
 'street': "A'Beckett Street",
 'suburb': 'Melbourne'}

The drawback of this approach is the Google API limits each user to 2500 requsts per 24 hours. So if I want to geocode 1 million locations then I would need to rent a lot of proxies or else the API calls will take over a year to complete (1,000,000 / 2,500 = 400 days). To meet this use case I built a module to reverse geocode a latitude / longitude coordinate using a list of known locations from geonames.

Here is some example usage:

>>> import reverse_geocode
>>> coordinates = (-37.81, 144.96), (31.76, 35.21)
>>> reverse_geocode.search(coordinates)
[{'city': 'Melbourne', 'country_code': 'AU', 'country': 'Australia'},
 {'city': 'Jerusalem', 'country_code': 'IL', 'country': 'Israel'}]

Internally the module uses a k-d tree to efficiently find the nearest neighbour of each given coordinate. On my netbook I find building the tree takes ~2.5 seconds and then each location query just ~1.5 ms.

The module is licensed under the LGPL on bitbucket: https://bitbucket.org/richardpenman/reverse_geocode