Posted 20 Jun 2012 in business and website

I have made an app with web2py for listing and selling databases. Hope you like it - and let me know if you have any problems or suggestions.

Around half the databases are free and can be accessed here.


Posted 05 May 2012 in CAPTCHA, OCR, example, and python

Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.

Fortunately many CAPTCHA’s are weak and can be solved by cleaning the image and using simple OCR. Here are some example CAPTCHA images from a recent website I worked with:

Helpfully the distracting marks are lighter so the image can be thresholded to isolate the text:

Now the resulting images can be passed to an OCR program to extract the text. Here are results from 3 popular open source OCR tools:

Captcha 1 Captcha 2 Captcha 3 Result
7rrg5 hirbZ izi3b
Tesseract 7rrq5 hirbZ izi3b 2 / 3
Gocr 7rr95 _i_bz izi3b 1 / 3
Ocrad 7rrgS hi_bL iLi3b 0 / 3

Excellent results. Getting 100% accuracy is not necessary when solving CAPTCHA’s, because real people make mistakes too so websites will just respond with another CAPTCHA to solve.

Tesseract only confused ‘g’ with ‘q’ and Gorc thought that ‘g’ was a ‘9’, which is understandable. Even though Ocrad did not get any correct on this small sample set, it was close every time. And this was without training on the font or fixing the text orientation.

If you are interested the Python code used is available for download here. It depends on the PIL for image processing and each of the OCR tools.


Posted 02 Apr 2012 in business

Business web directories are a great source of data and scraping data from them is a common request from clients. Below are my list of directories that I know of from each country or region. I have noticed that directories for poorer countries often disappear, so let me know if a link no longer works.

Location Business directories
Africa http://www.yellowpagesofafrica.com
Argentina http://www.paginasamarillas.com.ar
Australia http://www.hotfrog.com.au
http://www.yellowpages.com.au
Belgium http://www.pagesdor.be
Belarus http://www.b2b.by
Bolivia http://www.boliviaweb.com/business.htm
Brazil http://www.brazilbiz.com.br
http://www.telelistas.net
http://www.guiamais.com.br
Canada http://www.yellowpages.ca
http://www.ziplocal.ca
Chile http://www.chilnet.cl
http://www.amarillas.cl
China http://www.yellowpage.com.cn
Columbia http://www.quehubo.com/colombia/
Cyprus http://www.cyprusyellowpages.com
Czech Republic http://www.zlatestranky.cz
Denmark http://www.degulesider.dk
http://www.kob.dk
Estonia http://www.ee.ee
Europe http://www.europages.net/a>
Finland http://www.keltaisetsivut.fi
http://www.yritystele.fi
France http://www.pagesjaunes.fr
Germany http://www.businessdeutschland.de
http://www.klicktel.de (same site as http://www.11880.com)
http://www.yellow.de (same site as http://www.gelbeseiten.de)
Greece http://www.xo.gr
Hungary http://www.yellowpages.hu
Iceland http://www.gulalinan.is
India http://www.indiacom.com
Indonesia http://www.yellowpages.co.id
Ireland http://www.yourlocal.ie
http://www.goldenpages.ie
Israel http://www.d.co.il
Italy http://www.paginegialle.it
Japan http://itp.ne.jp
Latin America http://www.paginasamarillas.com
http://directory.centramerica.com
Lebanon http://www.pagesjaunes.com.lb
Lithuania http://www.visalietuva.lt
Malaysia http://www.yellowpages.com.my
Mexico http://seccionamarilla.com.mx
http://www.directory.com.mx
http://www.yellow.com.mx
Middle East http://www.ameinfo.com
Myanmar http://www.myanmaryellowpages.biz
Nepal http://www.nepalhomepage.com/yellowpages/
Netherlands http://www.detelefoongids.nl
New Zealand http://yellow.co.nz
Norway http://www.gulesider.no
Peru http://www.denperu.com.pe/denexe/busqueda.asp
Philippines http://www.eyp.ph
Poland http://www.pkt.pl
Portugal http://www.pai.pt
http://www.guianet.pt
Romania http://www.paginiaurii.ro
Russia http://www.sakh.com
Singapore http://www.yellowpages.com.sg
Spain http://www.paginas-amarillas.es
http://es.qdq.com
Sweden http://gulasidorna.eniro.se
Switzerland http://www.local.ch (same site as http://www.pages-jaunes.ch)
http://www.branchenbuch.ch
Turkey http://www.turkindex.com
Ukraine http://www.ukrainet.com.ua
http://www.mercury.odessa.ua
United Kingdom http://yell.com
http://www.192.com
http://www.scoot.co.uk
United States http://www.yellowpages.com
http://www.superpages.com
http://www.allpages.com
Venezuela http://www.pac.com.ve
Vietnam http://www.vietnamonline.com/yp.html
World http://maps.google.com
http://www.infobel.com


Posted 18 Mar 2012 in business

I am often asked whether web scraping is legal and I always respond the same - it depends what you do with the data.

If the data is just for private use then in practice this is fine. However if you intend to republish the scraped data then you need to consider what type of data this is.

The US Supreme Court case Feist Publications vs Rural Telephone Service established that scraping and republishing facts like telephone listings is allowed. A similar case in Australia Telstra vs Phone Directories concluded that data can not be copyrighted if there is no identifiable author. And in the European Union the case ofir.dk vs home.dk decided that regularly crawling and deep linking is permissible.

So if the scraped data constitutes facts (telephone listings, business locations, etc) then it can be republished. But if the data is original (articles, discussions, etc) then you need to be more careful.

Fortunately most clients who contact me are interested in the former type of data.

Web scraping is the wild west so laws and precedents are still being developed. And I am not a lawyer.


Posted 14 Feb 2012 in example, python, qt, and webkit

I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:

from webscraping import webkit
w = webkit.WebkitBrowser(gui=True) 
# load webpage
w.get('http://duckduckgo.com')
# fill search textbox 
w.fill('input[id=search_form_input_homepage]', 'sitescraper')
# take screenshot of browser
w.screenshot('duckduckgo_search.jpg')
# click search button 
w.click('input[id=search_button_homepage]')
# wait on results page
w.wait(10)
# take another screenshot
w.screenshot('duckduckgo_results.jpg')

Here are the screenshots saved:

I often use webkit when working with websites that rely heavily on JavaScript.

Source code is available on bitbucket.


Posted 10 Feb 2012 in cache, python, and sqlite

When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:

import pdict
# initiate cache
cache = pdict.PersistentDict('test.db')

# compresses and store content in the database
cache[url] = html 

# iterate all data in the database
for key in cache:
    print cache[key]

The bottleneck here is insertions so for efficiency records can be buffered and then inserted in a single transaction:

# dictionary of data to insert
data = {...}

# cache each record individually (2m49.827s)
cache = pdict.PersistentDict('test.db', max_buffer_size=0)
for k, v in data.items():
    cache[k] = v

# cache all records in a single transaction (0m0.774s)
cache = pdict.PersistentDict('test.db', max_buffer_size=5)
for k, v in data.items():
    cache[k] = v

In this example caching all records at once takes less than a second but caching each record individually takes almost 3 minutes.