Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.
Blog
-
Loading web browser cookies
Python Cookies Example April 15, 2015
-
Automatically import a CSV file into MySQL
Python Opensource Example December 08, 2012
Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.
-
How to find what technology a website uses
Python Opensource Example September 21, 2012
When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.
-
What is CSV?
CSV stands for comma separated values. It is a spreadsheet format where each column is separated by a comma and each row by a newline. Here is an example CSV file:
-
Converting UK Easting / Northing to Latitude / Longitude
Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.
-
Solving CAPTCHA with OCR
Python Captcha Ocr Example May 05, 2012
Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.
-
Automating webkit
Python Webkit Qt Example February 14, 2012
I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:
-
Threading with webkit
Javascript Webkit Qt Python Example Concurrent Efficiency December 30, 2011
In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.
-
Scraping multiple JavaScript webpages with webkit
Javascript Webkit Qt Python Example December 06, 2011
I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:
-
How to use proxies
Proxies Example November 29, 2011
Proxies can be necessary when web scraping because some websites restrict the number of page downloads from each user. With proxies it looks like your requests come from multiple users so the chance of being blocked is reduced.
-
How to automatically find contact details
Information retrieval Python Example November 06, 2011
I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.
This wastes my time so I use this snippet to automate extracting the available emails:
-
Webpage screenshots with webkit
Webkit Qt Python Screenshot Example September 20, 2011
For a recent project I needed to render screenshots of webpages. Here is my solution using webkit: