Page 7 of 8 for Blog | WebScraping.com

I love AJAX!

Ajax Javascript March 16, 2010

AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are often easier to scrape.

The trouble with scraping websites is they obscure the data I am after within a layer of HTML presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.

Read More
Scraping JavaScript webpages with webkit

Javascript Webkit Qt Python March 12, 2010

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

Read More
Scraping JavaScript based web pages with Chickenfoot

Javascript Chickenfoot March 02, 2010

The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
This can make scraping a little tougher, but not impossible.

The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

To get a feel for Chickenfoot here is an example to crawl a website:

Read More
How to crawl websites without being blocked

User-agent Crawling Proxies February 08, 2010

Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don’t (generally) want to be crawled by others. One such company is Google, ironically.

Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.

Read More
How to protect your data

Ip Ocr Captcha Google February 05, 2010

You spent time and money collecting the data in your website so you want to prevent someone else downloading and reusing it. However you still want Google to index your website so that people can find you. This is a common problem. Below I will outline some strategies to protect your data.

Read More
Why Python

Python February 02, 2010

Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote an article on why he likes Python that I expect resonates with many.

Read More
The sitescraper module

Sitescraper Opensource January 29, 2010

As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.

Read More
Web scraping with regular expressions

Regex Python January 20, 2010

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:

Read More
How to use XPaths robustly

Xpath January 05, 2010
In an earlier post I referred to XPaths but did not explain how to use them.

Say we have the following HTML document:
```
<html>  
 <body>
  <div></div>  
  <div id="content">  
   <ul>  
    <li>First item</li>  
    <li>Second item</li>  
   </ul>  
  </div>  
 </body>  
</html>
```
To access the list elements we follow the HTML structure from the root tag down to the li’s:
```
html > body > 2nd div > ul > many li's.
```
An XPath to represent this traversal is:
```
/html[1]/body[1]/div[2]/ul[1]/li
```
If a tag has no index then every tag of that type will be selected:
```
/html/body/div/ul/li
```
XPaths can also use attributes to select nodes:
```
/html/body/div[@id="content"]/ul/li 
```
And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:
```
//div[@id="content"]/ul/li
```
Read More
Parsing HTML with Python

Lxml Python Html January 02, 2010

HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags:

Read More

← Newer posts 7 of 8 Older posts →