WebScraping.com Logo

Blog

  • Threading with webkit

    Javascript Webkit Qt Python Example Concurrent Efficiency

    In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.

  • Scraping multiple JavaScript webpages with webkit

    Javascript Webkit Qt Python Example

    I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

  • I love AJAX!

    Ajax Javascript

    AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
    AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are often easier to scrape.

    The trouble with scraping websites is they obscure the data I am after within a layer of HTML presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

    These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.

  • Scraping JavaScript webpages with webkit

    Javascript Webkit Qt Python

    In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

  • Scraping JavaScript based web pages with Chickenfoot

    Javascript Chickenfoot

    The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
    This can make scraping a little tougher, but not impossible.

    The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

    A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

    To get a feel for Chickenfoot here is an example to crawl a website: