Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.

Most people seem to first try collecting their proxies from the various free lists such as this one and then get frustrated because the proxies stop working. If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip, USA proxies, or proxybonanza. These free lists are not reliable because so many people use them.

Each proxy will have the format login:password@IP:port
The login details and port are optional. Here are some examples:

  • bob:eakej34@66.12.121.140:8000
  • 219.66.12.12
  • 219.66.12.14:8080

With the webscraping library you can then use the proxies like this:

from webscraping import download  
D = download.Download(proxies=proxies, user_agent=user_agent)  
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random  
import StringIO  
  
def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):  
    """Download the content at this url and return the content  
"""  
    opener = urllib2.build_opener()  
    if proxies:  
        # download through a random proxy from the list  
        proxy = random.choice(proxies)  
        if url.lower().startswith('https://'):  
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))  
        else:  
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))  
      
    # submit these headers with the request  
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}  
      
    if isinstance(data, dict):  
        # need to post this data  
        data = urllib.urlencode(data)  
    try:  
        response = opener.open(urllib2.Request(url, data, headers))  
        content = response.read()  
        if response.headers.get('content-encoding') == 'gzip':  
            # data came back gzip-compressed so decompress it            
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()  
    except Exception, e:  
        # so many kinds of errors are possible here so just catch them all  
        print 'Error: %s %s' % (url, e)  
        content = None  
    return content