Using Google Translate to crawl a website
Posted 29 May 2011 in cache, crawling, and google

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

One option is using Google Translate, which let's you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

I added a function to download a URL via Google Translate and Google Cache to the webscraping library. Here is an example:

from webscraping import download, xpath  
  
D = download.Download()  
url = 'http://webscraping.com/faq'  
html1 = D.get(url) # download directly  
html2 = D.gcache_get(url) # download via Google Cache  
html3 = D.gtrans_get(url) # download via Google Translate  
for html in (html1, html2, html3):  
    print xpath.get(html, '//title')

This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Frequently asked questions | webscraping
Frequently asked questions | webscraping
Frequently asked questions | webscraping

The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.

blog comments powered by Disqus