I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:
Blog
-
Scraping multiple JavaScript webpages with webkit
Javascript Webkit Qt Python Example December 06, 2011
-
How to teach yourself web scraping
Learn Python Big picture December 03, 2011
I often get asked how to learn about web scraping. Here is my advice.
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
-
How to use proxies
Proxies Example November 29, 2011
Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.
-
How to automatically find contact details
Information retrieval Python Example November 06, 2011
I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.
This wastes my time so I use this snippet to automate extracting the available emails:
-
Free service to extract article from webpage
Information retrieval October 11, 2011
In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.
Here is one of my blog articles:
-
Webpage screenshots with webkit
Webkit Qt Python Screenshot Example September 20, 2011
For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:
-
Is it possible to extract data from any website?
Big picture August 10, 2011
I am often asked whether I can extract data from a particular website.
-
User agents
User-agent July 20, 2011
Your web browser will send what is known as a “User Agent” for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:
-
Taking advantage of mobile interfaces
Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don’t support JavaScript, and a simplified version for mobile users.
For example Gmail has:
-
Parsing Flash with Swiffy
Flash June 30, 2011
Google has released a tool called Swiffy for parsing Flash files into HTML5. This is relevant to web scraping because content embedded in Flash is a pain to extract, as I wrote about earlier.
I tried some test files and found the results no more useful for parsing text content than the output produced by swf2html (Linux version). Some neat example conversions are available here. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.