These days I am often contacted by businesses asking if I want to try a free trial of their service. A recent one was Luminati, which claimed to have access to millions of IP addresses. They weren’t willing to divulge much over email and their website had less information than it does now, so we set up a Skype call. My contact was a salesman so he wasn’t able to answer technical questions, but gave me a good overview of what they are trying to do. Apparently they are an Israeli startup that built a peer to peer network called Hola, where users install a plugin to access content that is blocked in their region by downloading via other peers in the network. Now that they had millions of users they wanted to monetize this network by reselling it as a proxy service. Great idea, though when I signed up for a test account with Hola this was not clear, so I doubt most users are aware their bandwidth is being resold.
Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.
Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don’t (generally) want to be crawled by others. One such company is Google, ironically.
Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.