WebScraping.com Logo

Blog

  • Offline reverse geocode

    Python Opensource Efficiency

    I often use Google’s geocoding API to find details about a location like this:

  • Generating a website screenshot history

    Webkit Python Qt Opensource

    There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.

  • Automatically import a CSV file into MySQL

    Python Opensource Example

    Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.

  • How to find what technology a website uses

    Python Opensource Example

    When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.

  • Automatic web scraping

    Sitescraper Opensource Big picture

    I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.

  • Open sourced web scraping code

    Opensource

    For most scraping jobs I use the same general approach of crawling, selecting the appropriate nodes, and then saving the results. Consequently I reuse a lot of code across projects, which I have now combined into a library. Most of this infrastructure is available open sourced on Google Code.

    The code in that repository is licensed under the LGPL, which means you are free to use it in your own applications (including commercial) but are obliged to release any changes you make. This is different than the more popular GPL license, which would make the library unusable in most commercial projects. And it is also different than the BSD and WTFPL style licenses, which would let people do whatever they want with the library including making changes and not releasing them.

    I think the LGPL is a good balance for libraries because it lets anyone use the code while everyone can benefit from improvements made by individual users.

  • The sitescraper module

    Sitescraper Opensource

    As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.