Caching data efficiently

When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:

import pdict
# initiate cache
cache = pdict.PersistentDict('test.db')

# compresses and store content in the database
cache[url] = html 

# iterate all data in the database
for key in cache:
    print cache[key]

The bottleneck here is insertions so for efficiency records can be buffered and then inserted in a single transaction:

# dictionary of data to insert
data = {...}

# cache each record individually (2m49.827s)
cache = pdict.PersistentDict('test.db', max_buffer_size=0)
for k, v in data.items():
    cache[k] = v

# cache all records in a single transaction (0m0.774s)
cache = pdict.PersistentDict('test.db', max_buffer_size=5)
for k, v in data.items():
    cache[k] = v

In this example caching all records at once takes less than a second but caching each record individually takes almost 3 minutes.

blog comments powered by Disqus