Web scraping with regular expressions

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:

import re
import time
import urllib2
from BeautifulSoup import BeautifulSoup
from lxml import html as lxmlhtml

def timeit(fn, *args):
    t1 = time.time()
    for i in range(100):
    t2 = time.time()
    print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
    soup = BeautifulSoup(html)
    return soup.html.head.title
def lxml_test(html):
    tree = lxmlhtml.fromstring(html)
    return tree.xpath('//title')[0].text_content()
def regex_test(html):
    return re.findall('<title>(.*?)</title>', html)[0]
if __name__ == '__main__':
    url = 'http://webscraping.com/blog/Web-scraping-with-regular-expressions/'
    html = urllib2.urlopen(url).read()
    for fn in (bs_test, lxml_test, regex_test):
        timeit(fn, html)

The results are:

regex_test took 40.032 ms
lxml_test took 1863.463 ms
bs_test took 54206.303 ms

That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.

XPaths are very useful for most web scraping tasks, but there is still a use case for regular expressions.

blog comments powered by Disqus