I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.
One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:
I added a function to download a URL via Google Translate and Google Cache to the webscraping library. Here is an example:
This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:
Frequently asked questions | webscraping Frequently asked questions | webscraping Frequently asked questions | webscraping
The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.