Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.
I have three solutions for periodically scraping a website:
- I provide the client with my web scraping code, which they can then execute regularly
- Client pays me a small fee in future whenever they want the data rescraped
- I build a web application that scrapes regularly and provides the data in a useful form
The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.
The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future.
Also it involves ongoing costs for them.
So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.
Unfortunately I find hosting on their server does not work well because the client will often have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using web2py), and though Python is great for development it cannot compare to PHP for ease of deployment. I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.
All this is far from ideal. The solution? - see the next post.