CSV stands for comma separated values. It is a spreadsheet format where each column is separated by a comma and each row by a newline. Here is an example CSV file:
Blog
-
What is CSV?
-
Converting UK Easting / Northing to Latitude / Longitude
Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.
-
Solving CAPTCHA with OCR
Python Captcha Ocr Example May 05, 2012
Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.
-
Useful business directories
Business April 02, 2012
Business web directories are a great source of data and scraping data from them is a common request from clients. Below are my list of directories that I know of from each country or region. I have noticed that directories for poorer countries often disappear, so let me know if a link no longer works.
-
Is Web Scraping legal?
Business March 18, 2012
I am often asked whether web scraping is legal and I always respond the same - it depends what you do with the data.
-
Automating webkit
Python Webkit Qt Example February 14, 2012
I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:
-
Caching data efficiently
Python Cache Sqlite February 10, 2012
When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:
-
How to make python faster
Python Efficiency February 01, 2012
Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:
-
Automatic web scraping
Sitescraper Opensource Big picture January 04, 2012
I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.
-
Threading with webkit
Javascript Webkit Qt Python Example Concurrent Efficiency December 30, 2011
In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.