xpath

Blog

Why reinvent the wheel?

Lxml Xpath Python Scrapy Beautifulsoup August 27, 2010

I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

How to use XPaths robustly

Xpath January 05, 2010

In an earlier post I referred to XPaths but did not explain how to use them.

Say we have the following HTML document:

<html>  
 <body>
  <div></div>  
  <div id="content">  
   <ul>  
    <li>First item</li>  
    <li>Second item</li>  
   </ul>  
  </div>  
 </body>  
</html>

To access the list elements we follow the HTML structure from the root tag down to the li’s:

html > body > 2nd div > ul > many li's.

An XPath to represent this traversal is:

/html[1]/body[1]/div[2]/ul[1]/li

If a tag has no index then every tag of that type will be selected:

/html/body/div/ul/li

XPaths can also use attributes to select nodes:

/html/body/div[@id="content"]/ul/li

And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:

//div[@id="content"]/ul/li