Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:
And here are the execution times:
$ time ./fib 3.099s $ time python fib.py 16.655s
As expected C has a much faster execution time - 5x faster in this case.
In the context of web scraping, executing instructions is less important because the bottleneck is I/O - downloading the webpages. But I use Python in other contexts too so let’s see if we can do better.
First install psyco. On Linux this is just:
sudo apt-get install python-psyco
Then modify the Python script to call psyco:
And here is the updated execution time:
$ time python fib.py 3.190s
Just 3 seconds - with psyco the execution time is now equivalent to the C example! Psyco achieves this by compiling code on the fly to avoid interpreting each line.
I now add the below snippet to most of my Python scripts to take advantage of psyco when installed:
I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.
However this approach is not helpful for me these days because most of my work involves scraping a website once-off. It is quicker to just specify the XPaths required than collect and test the training cases.
I would still like an automated approach to help me work more efficiently. Ideally I would have a solution that when given a website URL:
- crawls the website
- organize the webpages into groups that share the same template (a directory page will have a different HTML structure than a listing page)
- the group with the most amount of webpages should be the listings
- compare these listing webpages to find what is static (the template) and what changes
- the parts that change represent dynamic data such as description, reviews, etc
Apparently this process of scraping data automatically is known as wrapper induction in academia. Unfortunately there do not seem to be any good open source solutions yet. The most commonly referenced one is Templatemaker, which is aimed at small text blocks and crashes in my test cases of real webpages. The author stopped development in 2007.
Some commercial groups have developed their own solutions so this certainly is technically possible:
If I do not find an open source solution I plan to attempt building my own later this year.
In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.
Source code is available at my bitbucket account.
This is a simple solution that will keep all HTML in memory, which is not practical for large crawls. For large crawls you should save the results to disk. I use the pdict module for this.
Updated script to take a callback for processing the download immediately and avoid storing in memory.
Source code is available at my bitbucket account.
I often get asked how to learn about web scraping. Here is my advice.
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:
Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
Learn about the HTTP protocol, which is how you will interact with websites.
Learn about regular expressions:
Learn about XPath:
These FireFox extensions can make web scraping easier:
Some libraries that can make web scraping easier:
- my webscraping package
- lxml, for processing html
- mechanize, for automating forms
- requests module, which is easier to use and more powerful than urllib2
Some other resources:
Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.
Most people seem to first try collecting their proxies from the various free lists such as this one and then get frustrated because the proxies stop working. If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip, USA proxies, or proxybonanza. These free lists are not reliable because so many people use them.
Each proxy will have the format login:password@IP:port
The login details and port are optional. Here are some examples:
With the webscraping library you can then use the proxies like this:
The above script will download content through a random proxy from the given list. Here is a standalone version: