When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.
I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.
Here is some example usage of pdict:
>>> from webscraping.pdict import PersistentDict
>>> cache = PersistentDict(CACHE_FILE)
>>> cache[url1] = html1
>>> cache[url2] = html2
>>> url1 in cache
True
>>> cache[url1]
html1
>>> cache.keys()
[url1, url2]
>>> del cache[url1]
>>> url1 in cache
False