When I started in this field 3 years ago I was developing the sitescraper tool but now I use the webscraping package for most work, so the domain name change reflects this change. Also the field is commonly known as web scraping so webscraping.com is an awesome domain to have.
The old website and email addresses will be redirected to this new domain.
CSV stands for comma separated values. It is a spreadsheet format where each column is separated by a comma and each row by a newline. Here is an example CSV file:
A CSV file can be imported into a database or parsed with a programming language. This flexibility makes CSV the most common output format requested by clients for their scraped data.
Here is an example showing how to parse a CSV file with Python:
Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.
Here is the Python script I developed to perform this conversion quickly with the pyproj module:
Source code is available on bitbucket.
Around half the databases are free and can be accessed here.
Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.
Fortunately many CAPTCHA’s are weak and can be solved by cleaning the image and using simple OCR. Here are some example CAPTCHA images from a recent website I worked with:
Helpfully the distracting marks are lighter so the image can be thresholded to isolate the text:
Now the resulting images can be passed to an OCR program to extract the text. Here are results from 3 popular open source OCR tools:
|Captcha 1||Captcha 2||Captcha 3||Result|
|Tesseract||7rrq5||hirbZ||izi3b||2 / 3|
|Gocr||7rr95||_i_bz||izi3b||1 / 3|
|Ocrad||7rrgS||hi_bL||iLi3b||0 / 3|
Excellent results. Getting 100% accuracy is not necessary when solving CAPTCHA’s, because real people make mistakes too so websites will just respond with another CAPTCHA to solve.
Tesseract only confused ‘g’ with ‘q’ and Gorc thought that ‘g’ was a ‘9’, which is understandable. Even though Ocrad did not get any correct on this small sample set, it was close every time. And this was without training on the font or fixing the text orientation.
Business web directories are a great source of data and scraping data from them is a common request from clients. Below are my list of directories that I know of from each country or region. I have noticed that directories for poorer countries often disappear, so let me know if a link no longer works.