When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.
I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.
Here is some example usage of pdict:
I prefer to quote per project rather than per hour for my web scraping work because it:
- gives me incentive to increase my efficiency (by improving my infrastructure)
- gives the client security about the total cost
- avoids distrust about the number of hours actually worked
- makes me look more competitive compared to the hourly rates available in Asia and Eastern Europe
- is difficult to track time fairly when working on two or more projects simultaneously
- is easy to estimate complexity based on past experience, atleast compared to building websites
- involves less administration
For most scraping jobs I use the same general approach of crawling, selecting the appropriate nodes, and then saving the results. Consequently I reuse a lot of code across projects, which I have now combined into a library. Most of this infrastructure is available open sourced on Google Code.
The code in that repository is licensed under the LGPL, which means you are free to use it in your own applications (including commercial) but are obliged to release any changes you make. This is different than the more popular GPL license, which would make the library unusable in most commercial projects. And it is also different than the BSD and WTFPL style licenses, which would let people do whatever they want with the library including making changes and not releasing them.
I think the LGPL is a good balance for libraries because it lets anyone use the code while everyone can benefit from improvements made by individual users.
In a previous post I mentioned that web2py is my weapon of choice for building web applications. Before web2py I had learnt a variety of approaches to building dynamic websites (raw PHP, Python CGI, Turbogears, Symfony, Rails, Django), but find myself most productive with web2py.
This is because web2py:
- uses a pure Python templating system without restrictions - “we’re all consenting adults here”
- supports database migrations
- has automatic form generation and validation with SQLFORM
- runs on Google App Engine without modification
- has a highly active and friendly user forum
- rapid development - feature requests are often written and committed to trunk within the day
- supports multiple apps for a single install
- can develop apps through the browser admin
- commits to backward compatibility
- has no configuration files or dependencies - works out of the box
- has sensible defaults for view templates, imported modules, etc
- highly dependent on Massimo (the project leader)
- the name web2py is unattractive compared to rails, pylons, web.py, etc
- few designers, so the example applications look crude
inconsistent scattered documentation
[online book now available here!]
In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.
My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:
- provides a stable and consistent platform that I can use for multiple applications
- both the customer and I can login and manage it, so we do not need to expose our servers
- has generous free quotas, which I rarely exhaust
- only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (yet)
- limitations on maximum job time and interacting with the database
- have to trust Google with storing our scraped data
Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I am still looking for a silver bullet!
Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.
I have three solutions for periodically scraping a website:
- I provide the client with my web scraping code, which they can then execute regularly
- Client pays me a small fee in future whenever they want the data rescraped
- I build a web application that scrapes regularly and provides the data in a useful form
The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.
The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future.
Also it involves ongoing costs for them.
So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.
Unfortunately I find hosting on their server does not work well because the client will often have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using web2py), and though Python is great for development it cannot compare to PHP for ease of deployment. I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.
All this is far from ideal. The solution? - see the next post.