WebScraping.com Logo

Blog

  • Why reinvent the wheel?

    Lxml Xpath Python Scrapy Beautifulsoup

    I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

    I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

    To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
    The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

    To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

  • Parsing HTML with Python

    Lxml Python Html

    HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

    Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags: