cache

Blog

Cache

Using the internet archive to crawl a website

Python Cache Crawling October 14, 2012

If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.

Read More
Caching data efficiently

Python Cache Sqlite February 10, 2012

When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:

Read More
Using Google Translate to crawl a website

Google Crawling Cache May 29, 2011

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

Read More
Using Google Cache to crawl a website

Google Cache Crawling May 15, 2011

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

Read More
Caching crawled webpages

Python Cache July 10, 2010

When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.

I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

Here is some example usage of pdict:

Read More

Blog

Using the internet archive to crawl a website

Caching data efficiently

Using Google Translate to crawl a website

Using Google Cache to crawl a website

Caching crawled webpages