WebScraping.com Logo

Blog

  • Web Scraping User Interface

    Crawling Business Web2py

    When scraping a website, typically the majority of time is spent waiting for the data to download. So to be efficient I work on multiple scraping projects simultaneously.

  • Using the internet archive to crawl a website

    Python Cache Crawling

    If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.

  • Using Google Translate to crawl a website

    Google Crawling Cache

    I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

    One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

  • Using Google Cache to crawl a website

    Google Cache Crawling

    Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

    Fortunately there is an alternative - Google.

    If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

  • How to crawl websites without being blocked

    User-agent Crawling Proxies

    Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don’t (generally) want to be crawled by others. One such company is Google, ironically.

    Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.