Posted 06 Nov 2011 in IR, example, and python

I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

This wastes my time so I use this snippet to automate extracting the available emails:

import sys  
from webscraping import common, download  
  
def get_emails(website, max_depth):  
    """Returns a list of emails found at this website  
  
max_depth is how deep to follow links  
"""  
    D = download.Download()  
    return D.get_emails(website, max_depth=max_depth)  
  
if __name__ == '__main__':  
    try:  
        website = sys.argv[1]  
        max_depth = int(sys.argv[2])  
    except:  
        print 'Usage: %s <URL> <max depth>' % sys.argv[0]  
    else:  
        print get_emails(website, max_depth)

Example use:

>>> get_emails('http://webscraping.com', 1)  
['contact@webscraping.com']


Posted 11 Oct 2011 in IR

In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.

Here is one of my blog articles:

And here are the results when submitted to instapaper:

And here is a BBC article:

And again the results from instapaper:

Instapaper has not made this service public, so hopefully they add it to their official API in future.


Posted 20 Sep 2011 in example, python, qt, screenshot, and webkit

For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:

import sys
import time
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

class Screenshot(QWebView):
    def __init__(self):
        self.app = QApplication(sys.argv)
        QWebView.__init__(self)
        self._loaded = False
        self.loadFinished.connect(self._loadFinished)

    def capture(self, url, output_file):
        self.load(QUrl(url))
        self.wait_load()
        # set to webpage size
        frame = self.page().mainFrame()
        self.page().setViewportSize(frame.contentsSize())
        # render image
        image = QImage(self.page().viewportSize(), QImage.Format_ARGB32)
        painter = QPainter(image)
        frame.render(painter)
        painter.end()
        print 'saving', output_file
        image.save(output_file)

    def wait_load(self, delay=0):
        # process app events until page loaded
        while not self._loaded:
            self.app.processEvents()
            time.sleep(delay)
        self._loaded = False

    def _loadFinished(self, result):
        self._loaded = True

s = Screenshot()
s.capture('http://webscraping.com', 'website.png')
s.capture('http://webscraping.com/blog', 'blog.png')

Source code is available at my bitbucket account.


Posted 10 Aug 2011 in big picture

I am often asked whether I can extract data from a particular website.

And the answer is always yes - if the data is publicly available then it can be extracted. The majority of websites are straightforward to scrape, however some are more difficult and may not be practical to scrape if you have time or budget restrictions.

For example if the website restricts how many pages each IP address can access then it could take months to download the entire website. In that case I can use proxies to provide me multiple IP addresses and download the data faster, but this can get expensive if many proxies are required.

If the website uses JavaScript and AJAX to load their data then I usually use a tool like Firebug to reverse engineer how the website works, and then call the appropriate AJAX URLs directly. And if the JavaScript is obfuscated or particularly complicated I can use a browser renderer like webkit to execute the JavaScript and provide me with the final HTML.

Another difficulty is if the website uses CAPTCHA’s or stores their data in images. Then I would need to try parsing the images with OCR or hiring people (with cheaper hourly costs) to manually interpret the images.

In summary I can always extract publicly available data from a website, but the time and cost required will vary.


Posted 20 Jul 2011 in user-agent

Your web browser will send what is known as a “User Agent” for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:

Browser User Agent
Firefox on Windows XP Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on Linux Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows Vista Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows Vista Opera/9.00 (Windows NT 5.1; U; en)
Android Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhone Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
Blackberry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllib Python-urllib/2.1
Old Google Bot Googlebot/2.1 ( http://www.googlebot.com/bot.html)
New Google Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Bot msnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo Bot Yahoo! Slurp/Site Explorer

You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

Fortunately it is easy to set your User Agent to whatever you like:

  • For FireFox you can use User Agent Switcher extension.
  • For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser –user-agent=”my custom user agent”
  • For Internet Explorer you can use the UAPick extension.
  • And for Python scripts you can set the proxy header with:

    proxy = urllib2.ProxyHandler({‘http’: IP})
    opener = urllib2.build_opener(proxy)
    opener.urlopen(‘http://www.google.com’)

Using the default User Agent for your scraper is a common reason to be blocked, so don’t forget.


Posted 05 Jul 2011 in ajax and mobile

Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don’t support JavaScript, and a simplified version for mobile users.

For example Gmail has:

All three of these interfaces will display the content of your emails but use different layouts and features. The main entrance at gmail.com is well known for its use of AJAX to load content dynamically without refreshing the page. This leads to a better user experience but makes web automation or scraping harder.

On the other hand the static HTML interface has fewer features and is less efficient for users, but much easier to automate or scrape because all the content is available when the page loads.

So before scraping a website check for its HTML or mobile version, which when exist should be easier to scrape.

To find the HTML version try disabling JavaScript in your browser and see what happens.
To find the mobile version try adding the “m” subdomain (domain.com -> m.domain.com) or using a mobile user-agent.