Posted 06 Dec 2011 in example, javascript, python, qt, and webkit

I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

class Render(QWebPage):  
  def __init__(self, urls, cb):
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.urls = urls  
    self.cb = cb
    self.crawl()  
    self.app.exec_()  
      
  def crawl(self):  
    if self.urls:  
      url = self.urls.pop(0)  
      print 'Downloading', url  
      self.mainFrame().load(QUrl(url))  
    else:  
      self.app.quit()  
        
  def _loadFinished(self, result):  
    frame = self.mainFrame()  
    url = str(frame.url().toString())  
    html = frame.toHtml()  
    self.cb(url, html)
    self.crawl()  


def scrape(url, html):
    pass # add scraping code here


urls = ['http://webscraping.com', 'http://webscraping.com/blog']  
r = Render(urls, cb=scrape)

This is a simple solution that will keep all HTML in memory, which is not practical for large crawls. For large crawls you should save the results to disk. I use the pdict module for this. Updated script to take a callback for processing the download immediately and avoid storing in memory.

Source code is available at my bitbucket account.


Posted 03 Dec 2011 in big picture, learn, and python

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

Learn about the HTTP protocol, which is how you will interact with websites.

Learn about regular expressions:

Learn about XPath:

If necessary learn about JavaScript:

These FireFox extensions can make web scraping easier:

Some libraries that can make web scraping easier:

Some other resources:


Posted 29 Nov 2011 in example and proxies

Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.

Most people seem to first try collecting their proxies from the various free lists such as this one and then get frustrated because the proxies stop working. If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip, USA proxies, or proxybonanza. These free lists are not reliable because so many people use them.

Each proxy will have the format login:password@IP:port
The login details and port are optional. Here are some examples:

  • bob:eakej34@66.12.121.140:8000
  • 219.66.12.12
  • 219.66.12.14:8080

With the webscraping library you can then use the proxies like this:

from webscraping import download  
D = download.Download(proxies=proxies, user_agent=user_agent)  
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random  
import StringIO  
  
def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):  
    """Download the content at this url and return the content  
"""  
    opener = urllib2.build_opener()  
    if proxies:  
        # download through a random proxy from the list  
        proxy = random.choice(proxies)  
        if url.lower().startswith('https://'):  
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))  
        else:  
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))  
      
    # submit these headers with the request  
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}  
      
    if isinstance(data, dict):  
        # need to post this data  
        data = urllib.urlencode(data)  
    try:  
        response = opener.open(urllib2.Request(url, data, headers))  
        content = response.read()  
        if response.headers.get('content-encoding') == 'gzip':  
            # data came back gzip-compressed so decompress it            
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()  
    except Exception, e:  
        # so many kinds of errors are possible here so just catch them all  
        print 'Error: %s %s' % (url, e)  
        content = None  
    return content


Posted 06 Nov 2011 in IR, example, and python

I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

This wastes my time so I use this snippet to automate extracting the available emails:

import sys  
from webscraping import common, download  
  
def get_emails(website, max_depth):  
    """Returns a list of emails found at this website  
  
max_depth is how deep to follow links  
"""  
    D = download.Download()  
    return D.get_emails(website, max_depth=max_depth)  
  
if __name__ == '__main__':  
    try:  
        website = sys.argv[1]  
        max_depth = int(sys.argv[2])  
    except:  
        print 'Usage: %s <URL> <max depth>' % sys.argv[0]  
    else:  
        print get_emails(website, max_depth)

Example use:

>>> get_emails('http://webscraping.com', 1)  
['contact@webscraping.com']


Posted 11 Oct 2011 in IR

In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.

Here is one of my blog articles:

And here are the results when submitted to instapaper:

And here is a BBC article:

And again the results from instapaper:

Instapaper has not made this service public, so hopefully they add it to their official API in future.


Posted 20 Sep 2011 in example, python, qt, screenshot, and webkit

For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:

import sys
import time
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

class Screenshot(QWebView):
    def __init__(self):
        self.app = QApplication(sys.argv)
        QWebView.__init__(self)
        self._loaded = False
        self.loadFinished.connect(self._loadFinished)

    def capture(self, url, output_file):
        self.load(QUrl(url))
        self.wait_load()
        # set to webpage size
        frame = self.page().mainFrame()
        self.page().setViewportSize(frame.contentsSize())
        # render image
        image = QImage(self.page().viewportSize(), QImage.Format_ARGB32)
        painter = QPainter(image)
        frame.render(painter)
        painter.end()
        print 'saving', output_file
        image.save(output_file)

    def wait_load(self, delay=0):
        # process app events until page loaded
        while not self._loaded:
            self.app.processEvents()
            time.sleep(delay)
        self._loaded = False

    def _loadFinished(self, result):
        self._loaded = True

s = Screenshot()
s.capture('http://webscraping.com', 'website.png')
s.capture('http://webscraping.com/blog', 'blog.png')

Source code is available at my bitbucket account.