Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.
Blog
-
Loading web browser cookies
Python Cookies Example April 15, 2015
-
Offline reverse geocode
Python Opensource Efficiency June 01, 2014
I often use Google’s geocoding API to find details about a location like this:
-
The web services I use
Business October 22, 2013
A few friends asked me what web services I use to run my business so I am writing this to point people in future.
-
Working onsite in NYC
Business March 09, 2013
For the next year I am going to be working onsite at a web scraping focussed startup in New York. Looking forward to the experience! I found that working in the US is straightforward for Australian’s because of the fantastic E3 visa. I just took my job offer letter to the US consulate along with some documentation, paid a few hundred dollars, and within a fortnight I had a 2 year work visa that can be extended indefinitely.
-
Generating a website screenshot history
Webkit Python Qt Opensource January 03, 2013
There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.
-
Web Scraping User Interface
Crawling Business Web2py December 29, 2012
When scraping a website, typically the majority of time is spent waiting for the data to download. So to be efficient I work on multiple scraping projects simultaneously.
-
Automatically import a CSV file into MySQL
Python Opensource Example December 08, 2012
Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.
-
Asynchronous support in Python
Python Concurrent Big picture October 20, 2012
This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.
-
Using the internet archive to crawl a website
Python Cache Crawling October 14, 2012
If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.
-
How to find what technology a website uses
Python Opensource Example September 21, 2012
When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.