Posted 18 Mar 2012 in business

I am often asked whether web scraping is legal and I always respond the same - it depends what you do with the data.

If the data is just for private use then in practice this is fine. However if you intend to republish the scraped data then you need to consider what type of data this is.

The US Supreme Court case Feist Publications vs Rural Telephone Service established that scraping and republishing facts like telephone listings is allowed. A similar case in Australia Telstra vs Phone Directories concluded that data can not be copyrighted if there is no identifiable author. And in the European Union the case ofir.dk vs home.dk decided that regularly crawling and deep linking is permissible.

So if the scraped data constitutes facts (telephone listings, business locations, etc) then it can be republished. But if the data is original (articles, discussions, etc) then you need to be more careful.

Fortunately most clients who contact me are interested in the former type of data.

Web scraping is the wild west so laws and precedents are still being developed. And I am not a lawyer.


Posted 14 Feb 2012 in example, python, qt, and webkit

I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:

from webscraping import webkit
w = webkit.WebkitBrowser(gui=True) 
# load webpage
w.get('http://duckduckgo.com')
# fill search textbox 
w.fill('input[id=search_form_input_homepage]', 'sitescraper')
# take screenshot of browser
w.screenshot('duckduckgo_search.jpg')
# click search button 
w.click('input[id=search_button_homepage]')
# wait on results page
w.wait(10)
# take another screenshot
w.screenshot('duckduckgo_results.jpg')

Here are the screenshots saved:

I often use webkit when working with websites that rely heavily on JavaScript.

Source code is available on bitbucket.


Posted 10 Feb 2012 in cache, python, and sqlite

When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:

import pdict
# initiate cache
cache = pdict.PersistentDict('test.db')

# compresses and store content in the database
cache[url] = html 

# iterate all data in the database
for key in cache:
    print cache[key]

The bottleneck here is insertions so for efficiency records can be buffered and then inserted in a single transaction:

# dictionary of data to insert
data = {...}

# cache each record individually (2m49.827s)
cache = pdict.PersistentDict('test.db', max_buffer_size=0)
for k, v in data.items():
    cache[k] = v

# cache all records in a single transaction (0m0.774s)
cache = pdict.PersistentDict('test.db', max_buffer_size=5)
for k, v in data.items():
    cache[k] = v

In this example caching all records at once takes less than a second but caching each record individually takes almost 3 minutes.


Posted 01 Feb 2012 in efficiency and python

Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:

int fib(int n){
   if (n < 2)
     return n;
   else
     return fib(n - 1) + fib(n - 2);
}

int main() {
    fib(40);
    return 0;
}
def fib(n):
  if n < 2:
     return n
  else:
     return fib(n - 1) + fib(n - 2)
fib(40)

And here are the execution times:

$ time ./fib
3.099s
$ time python fib.py
16.655s

As expected C has a much faster execution time - 5x faster in this case.

In the context of web scraping, executing instructions is less important because the bottleneck is I/O - downloading the webpages. But I use Python in other contexts too so let’s see if we can do better.

First install psyco. On Linux this is just:

sudo apt-get install python-psyco

Then modify the Python script to call psyco:

import psyco
psyco.full()

def fib(n):
  if n < 2:
     return n
  else:
     return fib(n - 1) + fib(n - 2)
fib(40)

And here is the updated execution time:

$ time python fib.py
3.190s

Just 3 seconds - with psyco the execution time is now equivalent to the C example! Psyco achieves this by compiling code on the fly to avoid interpreting each line.

I now add the below snippet to most of my Python scripts to take advantage of psyco when installed:

try:
    import psyco
    psyco.full()
except ImportError:
    pass # psyco not installed so continue as usual


Posted 04 Jan 2012 in big picture, opensource, and sitescraper

I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.

However this approach is not helpful for me these days because most of my work involves scraping a website once-off. It is quicker to just specify the XPaths required than collect and test the training cases.

I would still like an automated approach to help me work more efficiently. Ideally I would have a solution that when given a website URL:

  • crawls the website
  • organize the webpages into groups that share the same template (a directory page will have a different HTML structure than a listing page)
  • the group with the most amount of webpages should be the listings
  • compare these listing webpages to find what is static (the template) and what changes
  • the parts that change represent dynamic data such as description, reviews, etc

Apparently this process of scraping data automatically is known as wrapper induction in academia. Unfortunately there do not seem to be any good open source solutions yet. The most commonly referenced one is Templatemaker, which is aimed at small text blocks and crashes in my test cases of real webpages. The author stopped development in 2007.

Some commercial groups have developed their own solutions so this certainly is technically possible:

If I do not find an open source solution I plan to attempt building my own later this year.


Posted 30 Dec 2011 in concurrent, efficiency, example, javascript, python, qt, and webkit

In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.

import sys
from collections import deque # threadsafe datatype
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
NUM_THREADS = 3 # how many threads to use

class Render(QWebView):
    active = deque() # track how many threads are still active
    data = {} # store the data

    def __init__(self, urls):
        QWebView.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.urls = urls
        self.crawl()

    def crawl(self):
        try:
            url = self.urls.pop()
            print 'downloading', url
            Render.active.append(1)
            self.load(QUrl(url))
        except IndexError:
            # no more urls to process
            if not Render.active:
                # no more threads downloading
                print 'finished'
                self.close()

    def _loadFinished(self, result):
        # process the downloaded html
        frame = self.page().mainFrame()
        url = str(frame.url().toString())
        Render.data[url] = frame.toHtml()
        Render.active.popleft()
        self.crawl() # crawl next URL in the list

app = QApplication(sys.argv) # can only instantiate this once so must move outside class 
urls = deque(['http://webscraping.com', 'http://webscraping.com/questions',
    'http://webscraping.com/blog', 'http://webscraping.com/projects'])
renders = [Render(urls) for i in range(NUM_THREADS)]
app.exec_() # will execute qt loop until class calls close event
print Render.data.keys()

Source code is available at my bitbucket account.