The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.
Here are examples of both approaches:
# a list of 100 webpage URLs to download urls = [...] # first try downloading sequentially import urllib for url in urls: urllib.urlopen(url).read() # now try concurrently import sys from webscraping import download num_threads = int(sys.argv) download.threaded_get(urls=urls, delay=0, num_threads=num_threads, read_cache=False, write_cache=False) # disable cache
Here are the results:
$ time python sequential.py 4m25.602s $ time python concurrent.py 10 1m7.430s $ time python concurrent.py 100 0m31.528s
As expected threading the downloads makes a big difference. You may have noticed the time saved is not linearly proportional to the number of threads. That is primarily because my web server struggles to keep up with all the requests. When crawling websites with threads be careful not to overload their web server by downloading too fast. Otherwise the website will become slower for others users and your IP risks being blacklisted.