When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:
import pdict
# initiate cache
cache = pdict.PersistentDict('test.db')
# compresses and store content in the database
cache[url] = html
# iterate all data in the database
for key in cache:
print cache[key]
The bottleneck here is insertions so for efficiency records can be buffered and then inserted in a single transaction:
# dictionary of data to insert
data = {...}
# cache each record individually (2m49.827s)
cache = pdict.PersistentDict('test.db', max_buffer_size=0)
for k, v in data.items():
cache[k] = v
# cache all records in a single transaction (0m0.774s)
cache = pdict.PersistentDict('test.db', max_buffer_size=5)
for k, v in data.items():
cache[k] = v
In this example caching all records at once takes less than a second but caching each record individually takes almost 3 minutes.