How to teach yourself web scraping
Posted 03 Dec 2011 in big picture, learn, and python

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

Learn about the HTTP protocol, which is how you will interact with websites.

Learn about regular expressions:

Learn about XPath:

If necessary learn about JavaScript:

These FireFox extensions can make web scraping easier:

Some libraries that can make web scraping easier:

Some other resources:

blog comments powered by Disqus