Caching crawled webpages

When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.

I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

Here is some example usage of pdict:

>>> from webscraping.pdict import PersistentDict  
>>> cache = PersistentDict(CACHE_FILE)  
>>> cache[url1] = html1  
>>> cache[url2] = html2  
>>> url1 in cache  
>>> cache[url1]  
>>> cache.keys()  
[url1, url2]  
>>> del cache[url1]  
>>> url1 in cache  

blog comments powered by Disqus