WebScraping.com Logo

Blog

  • Client feedback

    Website

    I was concerned about what blind spots I might have with the way I run my business. For example I am Australian and Australian’s are usually very informal, even in a professional setting - was my communication with international clients too informal?

    To try and address these concerns I developed a feedback survey with Google Docs, which I have been (politely) requesting my clients to complete at the end of a job. The results have been helpful, and it also seems to have impressed some clients that I wanted their feedback. Wish I had thought of this earlier!

  • Why reinvent the wheel?

    Lxml Xpath Python Scrapy Beautifulsoup

    I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

    I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

    To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
    The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

    To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

  • Best website for finding freelance work

    Business Elance Freelancing

    When I started freelancing I created accounts on every freelance site I could find (oDesk, guru, scriptlance, etc) to get as much work as possible. However I found I got almost all work from just one source - Elance. How is Elance different?

    With most freelancing sites you create an account and immediately start bidding on jobs. There is no cost to bidding so people bid on many projects even if they don’t have the skill or time to complete it. This is obviously frustrating for clients who waste a lot of time sifting through bids.

    On the other hand Elance has a high barrier to entry: you have to pass a test to show you understand their system, then receive a phone call to confirm your identity, and when established pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people weren’t willing to waste their money bidding for a job they can’t do. This barrier serves to weed out less serious freelancers so that the average bid is of higher quality.

    From my experience the clients are different on Elance too. On most freelancing sites the client is trying to get the job done for the smallest amount of money possible and are often willing to spend their time sifting through dozens of proposals, hoping to get lucky. Elance seems to attract clients who consider their time valuable and are willing to pay a premium for good service.
    Often clients contact me directly through Elance because I am native English and want to avoid potential communication or cultural problems. One client even requested me to double my bid because “we are not cheap!”

    After a year of freelancing I now get the majority of work directly through my website, but still get a decent percentage of clients through Elance.

  • Caching crawled webpages

    Python Cache

    When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.

    I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

    Here is some example usage of pdict:

  • Fixed fee or hourly?

    Business

    I prefer to quote per project rather than per hour for my web scraping work because it:

  • Open sourced web scraping code

    Opensource

    For most scraping jobs I use the same general approach of crawling, selecting the appropriate nodes, and then saving the results. Consequently I reuse a lot of code across projects, which I have now combined into a library. Most of this infrastructure is available open sourced on Google Code.

    The code in that repository is licensed under the LGPL, which means you are free to use it in your own applications (including commercial) but are obliged to release any changes you make. This is different than the more popular GPL license, which would make the library unusable in most commercial projects. And it is also different than the BSD and WTFPL style licenses, which would let people do whatever they want with the library including making changes and not releasing them.

    I think the LGPL is a good balance for libraries because it lets anyone use the code while everyone can benefit from improvements made by individual users.

  • Why web2py?

    Web2py

    In a previous post I mentioned that web2py is my weapon of choice for building web applications. Before web2py I had learnt a variety of approaches to building dynamic websites (raw PHP, Python CGI, Turbogears, Symfony, Rails, Django), but find myself most productive with web2py.

    This is because web2py:

  • Why Google App Engine?

    Gae

    In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.

    My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:

    Pros:

  • Scraping dynamic data

    Linux

    Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.

    I have three solutions for periodically scraping a website:

  • Scraping Flash based websites

    Flash Ajax

    Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple’s criticism of Flash are good news for me because they encourage developers to use non-Flash solutions.

    The current reality though is that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried: