WebScraping.com Logo

Blog

  • I love AJAX!

    Ajax Javascript

    AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
    AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are often easier to scrape.

    The trouble with scraping websites is they obscure the data I am after within a layer of HTML presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

    These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.

  • Scraping JavaScript webpages with webkit

    Javascript Webkit Qt Python

    In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

  • Scraping JavaScript based web pages with Chickenfoot

    Javascript Chickenfoot

    The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
    This can make scraping a little tougher, but not impossible.

    The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

    A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

    To get a feel for Chickenfoot here is an example to crawl a website:

  • How to crawl websites without being blocked

    User-agent Crawling Proxies

    Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don’t (generally) want to be crawled by others. One such company is Google, ironically.

    Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.

  • How to protect your data

    Ip Ocr Captcha Google

    You spent time and money collecting the data in your website so you want to prevent someone else downloading and reusing it. However you still want Google to index your website so that people can find you. This is a common problem. Below I will outline some strategies to protect your data.

  • Why Python

    Python

    Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote an article on why he likes Python that I expect resonates with many.

  • The sitescraper module

    Sitescraper Opensource

    As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.

  • Web scraping with regular expressions

    Regex Python

    Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:

  • How to use XPaths robustly

    Xpath

    In an earlier post I referred to XPaths but did not explain how to use them.

    Say we have the following HTML document:

    <html>  
     <body>
      <div></div>  
      <div id="content">  
       <ul>  
        <li>First item</li>  
        <li>Second item</li>  
       </ul>  
      </div>  
     </body>  
    </html>

    To access the list elements we follow the HTML structure from the root tag down to the li’s:

    html > body > 2nd div > ul > many li's.

    An XPath to represent this traversal is:

    /html[1]/body[1]/div[2]/ul[1]/li

    If a tag has no index then every tag of that type will be selected:

    /html/body/div/ul/li

    XPaths can also use attributes to select nodes:

    /html/body/div[@id="content"]/ul/li 

    And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:

    //div[@id="content"]/ul/li
  • Parsing HTML with Python

    Lxml Python Html

    HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

    Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags: