Page 2 of 8 for Blog | WebScraping.com

Loading web browser cookies

Python Cookies Example April 15, 2015

Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.

Offline reverse geocode

Python Opensource Efficiency June 01, 2014

I often use Google’s geocoding API to find details about a location like this:

The web services I use

Business October 22, 2013

A few friends asked me what web services I use to run my business so I am writing this to point people in future.

Working onsite in NYC

Business March 09, 2013

For the next year I am going to be working onsite at a web scraping focussed startup in New York. Looking forward to the experience! I found that working in the US is straightforward for Australian’s because of the fantastic E3 visa. I just took my job offer letter to the US consulate along with some documentation, paid a few hundred dollars, and within a fortnight I had a 2 year work visa that can be extended indefinitely.

Generating a website screenshot history

Webkit Python Qt Opensource January 03, 2013

There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.

Web Scraping User Interface

Crawling Business Web2py December 29, 2012

When scraping a website, typically the majority of time is spent waiting for the data to download. So to be efficient I work on multiple scraping projects simultaneously.

Automatically import a CSV file into MySQL

Python Opensource Example December 08, 2012

Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.

Asynchronous support in Python

Python Concurrent Big picture October 20, 2012

This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.

Using the internet archive to crawl a website

Python Cache Crawling October 14, 2012

If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.

How to find what technology a website uses

Python Opensource Example September 21, 2012

When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.