WebScraping.com Logo

Blog

  • Why reinvent the wheel?

    Lxml Xpath Python Scrapy Beautifulsoup

    I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

    I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

    To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
    The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

    To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

  • How to use XPaths robustly

    Xpath

    In an earlier post I referred to XPaths but did not explain how to use them.

    Say we have the following HTML document:

    <html>  
     <body>
      <div></div>  
      <div id="content">  
       <ul>  
        <li>First item</li>  
        <li>Second item</li>  
       </ul>  
      </div>  
     </body>  
    </html>

    To access the list elements we follow the HTML structure from the root tag down to the li’s:

    html > body > 2nd div > ul > many li's.

    An XPath to represent this traversal is:

    /html[1]/body[1]/div[2]/ul[1]/li

    If a tag has no index then every tag of that type will be selected:

    /html/body/div/ul/li

    XPaths can also use attributes to select nodes:

    /html/body/div[@id="content"]/ul/li 

    And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:

    //div[@id="content"]/ul/li