def init(self, urls, cb): self.app = QApplication(sys.argv)
self.urls = urls
self.cb = cb self.crawl()
url = self.urls.pop(0)
print 'Downloading', url
def _loadFinished(self, result):
frame = self.mainFrame()
url = str(frame.url().toString())
html = frame.toHtml()
self.cb(url, html) self.crawl()
def scrape(url, html): pass # add scraping code here
urls = ['http://webscraping.com', 'http://webscraping.com/blog']
r = Render(urls, cb=scrape)
This is a simple solution that will keep all HTML in memory, which is not practical for large crawls. For large crawls you should save the results to disk. I use the pdict module for this.
Updated script to take a callback for processing the download immediately and avoid storing in memory.
Source code is available at my bitbucket account.