WebScraping.com Logo

Blog

  • Asynchronous support in Python

    Python Concurrent Big picture

    This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.

  • Automatic web scraping

    Sitescraper Opensource Big picture

    I have been interested in automatic approaches to web scraping for a few years now. During university I created the SiteScraper library, which used training cases to automatically scrape webpages. This approach was particularly useful for scraping a website periodically because the model could automatically adapt when the structure was updated but the content remained static.

  • How to teach yourself web scraping

    Learn Python Big picture

    I often get asked how to learn about web scraping. Here is my advice.

    First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

    The following advice will assume you want to use Python for web scraping.
    If you have some programming experience then I recommend working through the Dive Into Python book:

    Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

  • Is it possible to extract data from any website?

    Big picture

    I am often asked whether I can extract data from a particular website.

  • Typical web scraping job

    Big picture Business

    In this post I will try to clarify what web scraping is all about by walking through a typical (though fictional) project.

  • What is web scraping?

    Big picture

    The internet contains a huge amount of useful data but most is not easily accessible. Web scraping involves extracting this data from websites into a structured format.