Page 3 of 8 for Blog | WebScraping.com

What is CSV?

Learn Example August 25, 2012

CSV stands for comma separated values. It is a spreadsheet format where each column is separated by a comma and each row by a newline. Here is an example CSV file:

Converting UK Easting / Northing to Latitude / Longitude

Example Python July 09, 2012

Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.

Solving CAPTCHA with OCR

Python Captcha Ocr Example May 05, 2012

Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.

Useful business directories

Business April 02, 2012

Business web directories are a great source of data and scraping data from them is a common request from clients. Below are my list of directories that I know of from each country or region. I have noticed that directories for poorer countries often disappear, so let me know if a link no longer works.

Is Web Scraping legal?

Business March 18, 2012

I am often asked whether web scraping is legal and I always respond the same - it depends what you do with the data.

Automating webkit

Python Webkit Qt Example February 14, 2012

I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:

Caching data efficiently

Python Cache Sqlite February 10, 2012

When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:

How to make python faster

Python Efficiency February 01, 2012

Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:

Automatic web scraping

Sitescraper Opensource Big picture January 04, 2012

I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.

Threading with webkit

Javascript Webkit Qt Python Example Concurrent Efficiency December 30, 2011

In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.