python | WebScraping.com

Blog

How to automate Android apps with Python

Android Mobile apps Python December 01, 2017

In a previous post I covered a way to monitor network activity in order to scrape the data from an Android application. Sometimes this appraoch will not work, for example if the data of interest is embedded within the app or perhaps the network traffic is encrypted. For these cases I use UIautomator, which is a Python wrapper to the Android testing framework.

Read More
Loading web browser cookies

Python Cookies Example April 15, 2015

Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.

Read More
Offline reverse geocode

Python Opensource Efficiency June 01, 2014

I often use Google’s geocoding API to find details about a location like this:

Read More
Generating a website screenshot history

Webkit Python Qt Opensource January 03, 2013

There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.

Read More
Automatically import a CSV file into MySQL

Python Opensource Example December 08, 2012

Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.

Read More
Asynchronous support in Python

Python Concurrent Big picture October 20, 2012

This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.

Read More
Using the internet archive to crawl a website

Python Cache Crawling October 14, 2012

If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.

Read More
How to find what technology a website uses

Python Opensource Example September 21, 2012

When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.

Read More
Converting UK Easting / Northing to Latitude / Longitude

Example Python July 09, 2012

Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.

Read More
Solving CAPTCHA with OCR

Python Captcha Ocr Example May 05, 2012

Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.

Read More
Automating webkit

Python Webkit Qt Example February 14, 2012

I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:

Read More
Caching data efficiently

Python Cache Sqlite February 10, 2012

When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:

Read More
How to make python faster

Python Efficiency February 01, 2012

Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:

Read More
Threading with webkit

Javascript Webkit Qt Python Example Concurrent Efficiency December 30, 2011

In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.

Read More
Scraping multiple JavaScript webpages with webkit

Javascript Webkit Qt Python Example December 06, 2011

I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

Read More
How to teach yourself web scraping

Learn Python Big picture December 03, 2011

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

Read More
How to automatically find contact details

Information retrieval Python Example November 06, 2011

I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

This wastes my time so I use this snippet to automate extracting the available emails:

Read More
Webpage screenshots with webkit

Webkit Qt Python Screenshot Example September 20, 2011

For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:

Read More
Crawling with threads

Concurrent Python April 10, 2011

The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.

Read More
Why reinvent the wheel?

Lxml Xpath Python Scrapy Beautifulsoup August 27, 2010

I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

Read More
Caching crawled webpages

Python Cache July 10, 2010

When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.

I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

Here is some example usage of pdict:

Read More
Scraping JavaScript webpages with webkit

Javascript Webkit Qt Python March 12, 2010

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

Read More
Why Python

Python February 02, 2010

Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote an article on why he likes Python that I expect resonates with many.

Read More
Web scraping with regular expressions

Regex Python January 20, 2010

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:

Read More
Parsing HTML with Python

Lxml Python Html January 02, 2010

HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags:

Read More