In a previous post I mentioned that web2py is my weapon of choice for building web applications. Before web2py I had learnt a variety of approaches to building dynamic websites (raw PHP, Python CGI, Turbogears, Symfony, Rails, Django), but find myself most productive with web2py.
This is because web2py:
- uses a pure Python templating system without restrictions - “we’re all consenting adults here”
- supports database migrations
- has automatic form generation and validation with SQLFORM
- runs on Google App Engine without modification
- has a highly active and friendly user forum
- rapid development - feature requests are often written and committed to trunk within the day
- supports multiple apps for a single install
- can develop apps through the browser admin
- commits to backward compatibility
- has no configuration files or dependencies - works out of the box
- has sensible defaults for view templates, imported modules, etc
- highly dependent on Massimo (the project leader)
- the name web2py is unattractive compared to rails, pylons, web.py, etc
- few designers, so the example applications look crude
inconsistent scattered documentation
[online book now available here!]
In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.
My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:
- provides a stable and consistent platform that I can use for multiple applications
- both the customer and I can login and manage it, so we do not need to expose our servers
- has generous free quotas, which I rarely exhaust
- only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (yet)
- limitations on maximum job time and interacting with the database
- have to trust Google with storing our scraped data
Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I am still looking for a silver bullet!
Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.
I have three solutions for periodically scraping a website:
- I provide the client with my web scraping code, which they can then execute regularly
- Client pays me a small fee in future whenever they want the data rescraped
- I build a web application that scrapes regularly and provides the data in a useful form
The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.
The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future.
Also it involves ongoing costs for them.
So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.
Unfortunately I find hosting on their server does not work well because the client will often have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using web2py), and though Python is great for development it cannot compare to PHP for ease of deployment. I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.
All this is far from ideal. The solution? - see the next post.
Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple’s criticism of Flash are good news for me because they encourage developers to use non-Flash solutions.
The current reality though is that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:
- Check for AJAX requests that may carry the data I am after between the flash app and server
- Extract text with the Macromedia Flash Search Engine SDK
- Use OCR to extract the text directly
Many flash apps are self contained and do not use AJAX requests to load their data, which means can rely on (1). And I have had poor results with (2) and (3).
Still no silver bullet…
AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are often easier to scrape.
The trouble with scraping websites is they obscure the data I am after within a layer of HTML presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.
These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.
- is slow because have to wait for FireFox to render the entire webpage
- is somewhat buggy and has a small user/developer community, mostly at MIT
An alternative solution that addresses all these points is webkit, the open source browser engine used most famously in Apple’s Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.
I can then analyze this resulting HTML with my standard Python tools like the webscraping module.