I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

Learn about the HTTP protocol, which is how you will interact with websites.

Learn about regular expressions:

Learn about XPath:

If necessary learn about JavaScript:

These FireFox extensions can make web scraping easier:

Some libraries that can make web scraping easier:

Some other resources:

Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.

Most people seem to first try collecting their proxies from the various free lists such as this one and then get frustrated because the proxies stop working. If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip, USA proxies, or proxybonanza. These free lists are not reliable because so many people use them.

Each proxy will have the format login:password@IP:port
The login details and port are optional. Here are some examples:

  • bob:eakej34@

With the webscraping library you can then use the proxies like this:

from webscraping import download  
D = download.Download(proxies=proxies, user_agent=user_agent)  
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random  
import StringIO  
def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):  
    """Download the content at this url and return the content  
    opener = urllib2.build_opener()  
    if proxies:  
        # download through a random proxy from the list  
        proxy = random.choice(proxies)  
        if url.lower().startswith('https://'):  
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))  
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))  
    # submit these headers with the request  
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}  
    if isinstance(data, dict):  
        # need to post this data  
        data = urllib.urlencode(data)  
        response = opener.open(urllib2.Request(url, data, headers))  
        content = response.read()  
        if response.headers.get('content-encoding') == 'gzip':  
            # data came back gzip-compressed so decompress it            
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()  
    except Exception, e:  
        # so many kinds of errors are possible here so just catch them all  
        print 'Error: %s %s' % (url, e)  
        content = None  
    return content

I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

This wastes my time so I use this snippet to automate extracting the available emails:

import sys  
from webscraping import common, download  
def get_emails(website, max_depth):  
    """Returns a list of emails found at this website  
max_depth is how deep to follow links  
    D = download.Download()  
    return D.get_emails(website, max_depth=max_depth)  
if __name__ == '__main__':  
        website = sys.argv[1]  
        max_depth = int(sys.argv[2])  
        print 'Usage: %s <URL> <max depth>' % sys.argv[0]  
        print get_emails(website, max_depth)

Example use:

>>> get_emails('http://webscraping.com', 1)  

In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.

Here is one of my blog articles:

And here are the results when submitted to instapaper:

And here is a BBC article:

And again the results from instapaper:

Instapaper has not made this service public, so hopefully they add it to their official API in future.

For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:

import sys
import time
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

class Screenshot(QWebView):
    def __init__(self):
        self.app = QApplication(sys.argv)
        self._loaded = False

    def capture(self, url, output_file):
        # set to webpage size
        frame = self.page().mainFrame()
        # render image
        image = QImage(self.page().viewportSize(), QImage.Format_ARGB32)
        painter = QPainter(image)
        print 'saving', output_file

    def wait_load(self, delay=0):
        # process app events until page loaded
        while not self._loaded:
        self._loaded = False

    def _loadFinished(self, result):
        self._loaded = True

s = Screenshot()
s.capture('http://webscraping.com', 'website.png')
s.capture('http://webscraping.com/blog', 'blog.png')

Source code is available at my bitbucket account.

I am often asked whether I can extract data from a particular website.

And the answer is always yes - if the data is publicly available then it can be extracted. The majority of websites are straightforward to scrape, however some are more difficult and may not be practical to scrape if you have time or budget restrictions.

For example if the website restricts how many pages each IP address can access then it could take months to download the entire website. In that case I can use proxies to provide me multiple IP addresses and download the data faster, but this can get expensive if many proxies are required.

If the website uses JavaScript and AJAX to load their data then I usually use a tool like Firebug to reverse engineer how the website works, and then call the appropriate AJAX URLs directly. And if the JavaScript is obfuscated or particularly complicated I can use a browser renderer like webkit to execute the JavaScript and provide me with the final HTML.

Another difficulty is if the website uses CAPTCHA’s or stores their data in images. Then I would need to try parsing the images with OCR or hiring people (with cheaper hourly costs) to manually interpret the images.

In summary I can always extract publicly available data from a website, but the time and cost required will vary.