I often get asked how to learn about web scraping. Here is my advice.
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:
Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
Learn about the HTTP protocol, which is how you will interact with websites.
Learn about regular expressions:
Learn about XPath:
If necessary learn about JavaScript:
These FireFox extensions can make web scraping easier:
Some libraries that can make web scraping easier:
- my webscraping package
- lxml, for processing html
- mechanize, for automating forms
- requests module, which is easier to use and more powerful than urllib2
Some other resources: