- is slow because have to wait for FireFox to render the entire webpage
- is somewhat buggy and has a small user/developer community, mostly at MIT
An alternative solution that addresses all these points is webkit, the open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.
import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() url = 'http://webscraping.com' r = Render(url) html = r.frame.toHtml()
I can then analyze this resulting HTML with my standard Python tools like the webscraping module.