WebScraping.com Logo

Blog

  • Taking advantage of mobile interfaces

    Mobile Ajax

    Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don’t support JavaScript, and a simplified version for mobile users.

    For example Gmail has:

  • Scraping Flash based websites

    Flash Ajax

    Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple’s criticism of Flash are good news for me because they encourage developers to use non-Flash solutions.

    The current reality though is that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:

  • I love AJAX!

    Ajax Javascript

    AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
    AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are often easier to scrape.

    The trouble with scraping websites is they obscure the data I am after within a layer of HTML presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

    These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.