Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.
Fortunately there is an alternative - Google.
If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.
So instead of downloading a URL we want directly we can download it indirectly via http://www.google.com/search?&q=cache%3Ahttp%3A//webscraping.com. Then the source website can not block you and does not even know you are crawling their content.
The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.
Here are examples of both approaches:
Here are the results:
$ time python sequential.py 4m25.602s $ time python concurrent.py 10 1m7.430s $ time python concurrent.py 100 0m31.528s
As expected threading the downloads makes a big difference. You may have noticed the time saved is not linearly proportional to the number of threads. That is primarily because my web server struggles to keep up with all the requests. When crawling websites with threads be careful not to overload their web server by downloading too fast. Otherwise the website will become slower for others users and your IP risks being blacklisted.
Often the data sets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:
>>> gsutil mb gs://bucket_name >>> gsutil ls gs://bucket_name >>> gsutil cp path/to/file.ext gs://bucket_name >>> gsutil ls gs://bucket_name file.ext >>> gsutil cp gs://bucket_name/file.ext file_copy.ext
By now you would be used to entering the text for an image like this:
The idea is this will prevent bots because only a real user can interpret the image.
However this is not an obstacle for a determined scraper because of services like deathbycaptcha that will solve the CAPTCHA for you. These services use cheap labor to manually interpret the images and send the result back through an API.
CAPTCHA’s are still useful because they deter most bots. However they can not prevent a determined scraper and are annoying to genuine users.
An ongoing problem for my web scraping work is how much to quote for a job. I prefer fixed fee to hourly rates so I need to consider the complexity upfront. My initial strategy was simply to quote low to ensure I got business and hopefully build up some regular clients.
Through experience I found the following factors most effected the time required for a job:
- Website size
- Login protected
- IP restrictions
- HTML quality
I developed a formula based on these factors and have now built an interface that lets potential clients clarify the costs involved with different kinds of web scraping jobs. Additionally I hope this will reduce the communication overhead by helping clients to provide the necessary information upfront.
Google App Engine provides generous free quotas for your app and additional paid quotas.
I always enable billing for my GAE apps even though I rarely exhaust the free quotas. Enabling billing and setting paid quotas does not mean you have to pay anything and in fact increases what you get for free.
Here is a screenshot of the billing panel:
GAE lets you allocate a daily budget to the various resources, with the minimum permitted budget being USD $1. When you exhaust a free quota you will only be charged for the budget allocated to it. In the above screenshot I have allocated all my budget to emailing, but since my app does use the Mail API I can be confident this free quota will never be exhausted and I will never pay a cent. For another app that does use Mail I have allocated all the budget to Bandwidth Out instead.
Now with billing enabled my app:
- can access the Blobstore API to store larger amounts of data
- enjoys much higher free limits for the Mail, Task Queue, and UrlFetch API’s - for example by default an app can make 7000 Mail API calls but with billing enabled this limit jumps to 1,700,000 calls
- has a higher per minute CPU limit, which I find particularly useful when uploading a mass of records to the Datastore
So in summary you can enable billing to extend your free quotas without risk of paying.