Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.
Most people seem to first try collecting their proxies from the various free lists such as this one and then get frustrated because the proxies stop working. If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip, USA proxies, or proxybonanza. These free lists are not reliable because so many people use them.
Each proxy will have the format login:password@IP:port
The login details and port are optional. Here are some examples:
- bob:eakej34@
With the webscraping library you can then use the proxies like this:
from webscraping import download
D = download.Download(proxies=proxies, user_agent=user_agent)
html = D.get(url)
The above script will download content through a random proxy from the given list. Here is a standalone version:
import urllib2
import gzip
import random
import StringIO
def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):
"""Download the content at this url and return the content
opener = urllib2.build_opener()
if proxies:
# download through a random proxy from the list
proxy = random.choice(proxies)
if url.lower().startswith('https://'):
opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))
opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))
# submit these headers with the request
headers = {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}
if isinstance(data, dict):
# need to post this data
data = urllib.urlencode(data)
response =, data, headers))
content =
if response.headers.get('content-encoding') == 'gzip':
# data came back gzip-compressed so decompress it
content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()
except Exception, e:
# so many kinds of errors are possible here so just catch them all
print 'Error: %s %s' % (url, e)
content = None
return content