In a previous post I covered a way to monitor network activity in order to scrape the data from an Android application. Sometimes this appraoch will not work, for example if the data of interest is embedded within the app or perhaps the network traffic is encrypted. For these cases I use UIautomator, which is a Python wrapper to the Android testing framework.
Blog
-
How to automate Android apps with Python
Android Mobile apps Python December 01, 2017
-
Loading web browser cookies
Python Cookies Example April 15, 2015
Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.
-
Offline reverse geocode
Python Opensource Efficiency June 01, 2014
I often use Google’s geocoding API to find details about a location like this:
-
Generating a website screenshot history
Webkit Python Qt Opensource January 03, 2013
There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.
-
Automatically import a CSV file into MySQL
Python Opensource Example December 08, 2012
Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.
-
Asynchronous support in Python
Python Concurrent Big picture October 20, 2012
This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.
-
Using the internet archive to crawl a website
Python Cache Crawling October 14, 2012
If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.
-
How to find what technology a website uses
Python Opensource Example September 21, 2012
When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.
-
Converting UK Easting / Northing to Latitude / Longitude
Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.
-
Solving CAPTCHA with OCR
Python Captcha Ocr Example May 05, 2012
Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.
-
Automating webkit
Python Webkit Qt Example February 14, 2012
I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:
-
Caching data efficiently
Python Cache Sqlite February 10, 2012
When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:
-
How to make python faster
Python Efficiency February 01, 2012
Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:
-
Threading with webkit
Javascript Webkit Qt Python Example Concurrent Efficiency December 30, 2011
In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.
-
Scraping multiple JavaScript webpages with webkit
Javascript Webkit Qt Python Example December 06, 2011
I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:
-
How to teach yourself web scraping
Learn Python Big picture December 03, 2011
I often get asked how to learn about web scraping. Here is my advice.
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
-
How to automatically find contact details
Information retrieval Python Example November 06, 2011
I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.
This wastes my time so I use this snippet to automate extracting the available emails:
-
Webpage screenshots with webkit
Webkit Qt Python Screenshot Example September 20, 2011
For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:
-
Crawling with threads
Concurrent Python April 10, 2011
The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.
-
Why reinvent the wheel?
Lxml Xpath Python Scrapy Beautifulsoup August 27, 2010
I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.
I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.
To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.
-
Caching crawled webpages
When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.
I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.
Here is some example usage of pdict:
-
Scraping JavaScript webpages with webkit
Javascript Webkit Qt Python March 12, 2010
In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:
-
Why Python
Python February 02, 2010
Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote an article on why he likes Python that I expect resonates with many.
-
Web scraping with regular expressions
Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:
-
Parsing HTML with Python
Lxml Python Html January 02, 2010
HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.
Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags: