This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.
Blog
-
Asynchronous support in Python
Python Concurrent Big picture October 20, 2012
-
Automatic web scraping
Sitescraper Opensource Big picture January 04, 2012
I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.
-
How to teach yourself web scraping
Learn Python Big picture December 03, 2011
I often get asked how to learn about web scraping. Here is my advice.
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
-
Is it possible to extract data from any website?
Big picture August 10, 2011
I am often asked whether I can extract data from a particular website.
-
Typical web scraping job
Big picture Business December 30, 2009
In this post I will try to clarify what web scraping is all about by walking through a typical (though fictional) project.
-
What is web scraping?
Big picture December 20, 2009
The internet contains a huge amount of useful data but most is not easily accessible. Web scraping involves extracting this data from websites into a structured format.