Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:
import re import time import urllib2 from BeautifulSoup import BeautifulSoup from lxml import html as lxmlhtml def timeit(fn, *args): t1 = time.time() for i in range(100): fn(*args) t2 = time.time() print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0) def bs_test(html): soup = BeautifulSoup(html) return soup.html.head.title def lxml_test(html): tree = lxmlhtml.fromstring(html) return tree.xpath('//title').text_content() def regex_test(html): return re.findall('<title>(.*?)</title>', html) if __name__ == '__main__': url = 'http://webscraping.com/blog/Web-scraping-with-regular-expressions/' html = urllib2.urlopen(url).read() for fn in (bs_test, lxml_test, regex_test): timeit(fn, html)
The results are:
regex_test took 40.032 ms lxml_test took 1863.463 ms bs_test took 54206.303 ms
That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.
XPaths are very useful for most web scraping tasks, but there is still a use case for regular expressions.