Open sourced web scraping code
Posted 12 Jun 2010 in opensource

For most scraping jobs I use the same general approach of crawling, selecting the appropriate nodes, and then saving the results. Consequently I reuse a lot of code across projects, which I have now combined into a library. Most of this infrastructure is available open sourced on Google Code.

The code in that repository is licensed under the LGPL, which means you are free to use it in your own applications (including commercial) but are obliged to release any changes you make. This is different than the more popular GPL license, which would make the library unusable in most commercial projects. And it is also different than the BSD and WTFPL style licenses, which would let people do whatever they want with the library including making changes and not releasing them.

I think the LGPL is a good balance for libraries because it lets anyone use the code while everyone can benefit from improvements made by individual users.

blog comments powered by Disqus