These days I am often contacted by businesses asking if I want to try a free trial of their service. A recent one was Luminati, which claimed to have access to millions of IP addresses. They weren’t willing to divulge much over email and their website had less information than it does now, so we set up a Skype call. My contact was a salesman so he wasn’t able to answer technical questions, but gave me a good overview of what they are trying to do. Apparently they are an Israeli startup that built a peer to peer network called Hola, where users install a plugin to access content that is blocked in their region by downloading via other peers in the network. Now that they had millions of users they wanted to monetize this network by reselling it as a proxy service. Great idea, though when I signed up for a test account with Hola this was not clear, so I doubt most users are aware their bandwidth is being resold.

Unfortunately I found Lumanati’s costs are prohibitive for my typical usage:

A medium scale website requires roughly 50GB of downloading, which would work out at $1000 if downloaded through Luminati, so using this service would only be practical for small websites that quickly block IP’s.

A few years ago I opened a US bank account so that US clients who wanted to pay by bank transfer could avoid needing to make an international transaction. This worked well until last month when clients started reporting their transfers were being rejected. I rang the bank (Chase) and after being transferred between a few departments was told I needed to come into a US branch with my passport to discuss the problem, which couldn’t be handled over the phone. Quite inconvenient because I don’t live in the US and didn’t plan to visit in the near future.

I expect the problem is receiving transactions from multiple client bank accounts raised some automated red flag, but won’t know for sure until next time visit the US. So now my US bank account was frozen without warning or explanation - as you can imagine I was not a happy camper.

This experience made me more sympathetic to the Bitcoin Libertarian philosophy where there is no centralized authority to get in the way of doing business. Consequently, I researched how the protocol works and added support for Bitcoin to my data store and invoice system. I originally budgeted a week of time to figure this all out but found the protocol much simpler than expected and in the end took just a weekend, certainly easier than an earlier integration I did with the PayPal API. The only complexity is that because transactions are anonymous I needed to generate a unique bitcoin address for each client so that I know who the transaction is from. This is how it works:

  1. Find the current exchange rate between the clients currency and Bitcoin
  2. Generate a bitcoin address to receive this transaction
  3. When expected transaction is received at this address mark as paid

That’s it. has a well documented API that can handle each of these steps. For exchange rates there is the Ticker API and for managing and monitoring addresses in steps 2 & 3 there is the Receive API.

Also then to display a QR code for the required transaction I used the Google Chart API:

If you experience any problems with this support for Bitcoin or have suggestions to make it more intuitive, please get in touch.

A significant update to the Android Apps database is now ready, which now contains over 2 million apps (2,130,732 to be exact). If you have purchased this database previously you can login to your account to download the updated version for free.

The latest version of the UPC database now contains over 7.5 million products, which is over a million more than the previous version. If you have purchased this database previously you can login to your account to download the updated version for free.

I searched my email and found over the last few years I received 76 messages from clients containing the text Web Scrapping rather than the usual spelling Web Scraping. And this is not unique to my clients - currently Google has 122,000 results for “Web Scrapping” compared to 447,000 results for “Web Scraping” - the correct spelling returns only 4x the number of results. So in light of this common spelling mistake I registered the domain and redirected it here.

Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.

I have now packaged this solution as an open source python module. Here is some example usage:

>>> from webscraping import common, xpath
>>> import requests
>>> import browser_cookie
>>> cj = browser_cookie.load()
>>> r = requests.get('', cookies=cj)
>>> common.normalize(xpath.get(r.content, '//title'))
'richardpenman / home — Bitbucket'

If you have a bitbucket account and are logged in in a supported browser then you should see your account name printed here. Currently Firefox (Linux/OSX/Windows) and Chrome (Linux/OSX) are supported and I will add more platforms if get the chance to test.