html | WebScraping.com

HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags:

Blog

Parsing HTML with Python