I often get asked how to learn about web scraping. Here is my advice.
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:
Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
Learn about the HTTP protocol, which is how you will interact with websites.
Learn about regular expressions:
Learn about XPath:
These FireFox extensions can make web scraping easier:
Some libraries that can make web scraping easier:
- my webscraping package
- lxml, for processing html
- mechanize, for automating forms
- requests module, which is easier to use and more powerful than urllib2
Some other resources:
Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.
Most people seem to first try collecting their proxies from the various free lists such as this one and then get frustrated because the proxies stop working. If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip, USA proxies, or proxybonanza. These free lists are not reliable because so many people use them.
Each proxy will have the format login:password@IP:port
The login details and port are optional. Here are some examples:
With the webscraping library you can then use the proxies like this:
The above script will download content through a random proxy from the given list. Here is a standalone version:
I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.
This wastes my time so I use this snippet to automate extracting the available emails:
In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.
Here is one of my blog articles:
And here are the results when submitted to instapaper:
And here is a BBC article:
And again the results from instapaper:
Instapaper has not made this service public, so hopefully they add it to their official API in future.
For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:
Source code is available at my bitbucket account.
I am often asked whether I can extract data from a particular website.
And the answer is always yes - if the data is publicly available then it can be extracted. The majority of websites are straightforward to scrape, however some are more difficult and may not be practical to scrape if you have time or budget restrictions.
For example if the website restricts how many pages each IP address can access then it could take months to download the entire website. In that case I can use proxies to provide me multiple IP addresses and download the data faster, but this can get expensive if many proxies are required.
Another difficulty is if the website uses CAPTCHA’s or stores their data in images. Then I would need to try parsing the images with OCR or hiring people (with cheaper hourly costs) to manually interpret the images.
In summary I can always extract publicly available data from a website, but the time and cost required will vary.