Is it possible to extract data from any website?
Posted 10 Aug 2011 in big picture

I am often asked whether I can extract data from a particular website.

And the answer is always yes - if the data is publicly available then it can be extracted. The majority of websites are straightforward to scrape, however some are more difficult and may not be practical to scrape if you have time or budget restrictions.

For example if the website restricts how many pages each IP address can access then it could take months to download the entire website. In that case I can use proxies to provide me multiple IP addresses and download the data faster, but this can get expensive if many proxies are required.

If the website uses JavaScript and AJAX to load their data then I usually use a tool like Firebug to reverse engineer how the website works, and then call the appropriate AJAX URLs directly. And if the JavaScript is obfuscated or particularly complicated I can use a browser renderer like webkit to execute the JavaScript and provide me with the final HTML.

Another difficulty is if the website uses CAPTCHA’s or stores their data in images. Then I would need to try parsing the images with OCR or hiring people (with cheaper hourly costs) to manually interpret the images.

In summary I can always extract publicly available data from a website, but the time and cost required will vary.

blog comments powered by Disqus