Page 5 of 8 for Blog | WebScraping.com

Google App Engine limitations

Google Gae June 19, 2011

Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.

These are some of the downsides I have found using Google App Engine:

Read More
Using Google Translate to crawl a website

Google Crawling Cache May 29, 2011

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is helpful to have backup options.

One option is using Google Translate, which let’s you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content:

Read More
Using Google Cache to crawl a website

Google Cache Crawling May 15, 2011

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn’t exist in Google’s search results then for most people it doesn’t exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

Read More
Crawling with threads

Concurrent Python April 10, 2011

The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.

Read More
Google Storage

Google March 30, 2011
Often the data sets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:
```
>>> gsutil mb gs://bucket_name  
>>> gsutil ls  
gs://bucket_name  
>>> gsutil cp path/to/file.ext gs://bucket_name  
>>> gsutil ls gs://bucket_name  
file.ext  
>>> gsutil cp gs://bucket_name/file.ext file_copy.ext
```
Read More
Automating CAPTCHA's

Captcha February 20, 2011

By now you would be used to entering the text for an image like this:

Read More
Automated quote tool

Business November 06, 2010

An ongoing problem for my web scraping work is how much to quote for a job. I prefer fixed fee to hourly rates so I need to consider the complexity upfront. My initial strategy was simply to quote low to ensure I got business and hopefully build up some regular clients.

Through experience I found the following factors most effected the time required for a job:

Read More
Increase your Google App Engine quotas for free

Gae October 27, 2010

Google App Engine provides generous free quotas for your app and additional paid quotas.
I always enable billing for my GAE apps even though I rarely exhaust the free quotas. Enabling billing and setting paid quotas does not mean you have to pay anything and in fact increases what you get for free.

Here is a screenshot of the billing panel:

Read More
Extracting article summaries

Information retrieval October 06, 2010

I made my own version of this technique to extract article summaries.
Source code can be found here.

The idea is simple - extract the biggest text block - but performs well.
Here are some test results:

http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1

Read More
Image efficiencies

Image September 14, 2010

I needed to store a large quantities of images so took the following measurements:

Read More

← Newer posts 5 of 8 Older posts →