When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:
The bottleneck here is insertions so for efficiency records can be buffered and then inserted in a single transaction:
In this example caching all records at once takes less than a second but caching each record individually takes almost 3 minutes.