WebScraping.com Logo

Blog

  • How to automate Android apps with Python

    Android Mobile apps Python

    In a previous post I covered a way to monitor network activity in order to scrape the data from an Android application. Sometimes this appraoch will not work, for example if the data of interest is embedded within the app or perhaps the network traffic is encrypted. For these cases I use UIautomator, which is a Python wrapper to the Android testing framework.

  • Loading web browser cookies

    Python Cookies Example

    Sometimes when scraping a website I need my script to login in order to access the data of interest. Usually reverse engineering the login form is straightforward, however some websites makes this difficult. For example if login requires passing a CAPTCHA. Or if the website only allows one simultaneous login session per account. For difficult cases such as these I have an alternative solution - manually login to the website of interest in a web browser and then have my script load and reuse the login session.

  • Offline reverse geocode

    Python Opensource Efficiency

    I often use Google’s geocoding API to find details about a location like this:

  • Generating a website screenshot history

    Webkit Python Qt Opensource

    There is a nice website screenshots.com that hosts historic screenshots for many websites. This post will show how to generate our own historic screenshots with python.

  • Automatically import a CSV file into MySQL

    Python Opensource Example

    Sometimes I need to import large spreadsheets into MySQL. The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field. The hard way would be to manually determine the type of each field to define the schema.

  • Asynchronous support in Python

    Python Concurrent Big picture

    This week Guido Van Rossum (author of Python) put out a call for experts at asynchronous programming to collaborate on a new API.

  • Using the internet archive to crawl a website

    Python Cache Crawling

    If a website is offline or restricts how quickly it can be crawled then downloading from someone else’s cache can be necessary. In previous posts I discussed using Google Translate and Google Cache to help crawl a website. Another useful source is the Wayback Machine at archive.org, which has been crawling and caching webpages since 1998.

  • How to find what technology a website uses

    Python Opensource Example

    When crawling websites it can be useful to know what technology has been used to develop a website. For example with a ASP.net website I can expect the navigation to rely on POSTed data and sessions, which makes crawling more difficult. And for Blogspot websites I can expect the archive list to be in a certain location.

  • Converting UK Easting / Northing to Latitude / Longitude

    Example Python

    Recently I needed to convert a large amount of data between UK Easting / Northing coordinates and Latitude / Longitude. There are web services available that support this conversion but they only permit a few hundred requests / hour, which means it would take weeks to process my quantity of data.

  • Solving CAPTCHA with OCR

    Python Captcha Ocr Example

    Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA’s this becomes prohibitively expensive. For example solving 1 million CAPTCHA’s with this API would cost $1390.

  • Automating webkit

    Python Webkit Qt Example

    I have received some inquiries about using webkit for web scraping, so here is an example using the webscraping module:

  • Caching data efficiently

    Python Cache Sqlite

    When crawling websites I usually cache all HTML on disk to avoid having to re-download later. I wrote the pdict module to automate this process. Here is an example:

  • How to make python faster

    Python Efficiency

    Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:

  • Threading with webkit

    Javascript Webkit Qt Python Example Concurrent Efficiency

    In a previous post I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.

  • Scraping multiple JavaScript webpages with webkit

    Javascript Webkit Qt Python Example

    I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

  • How to teach yourself web scraping

    Learn Python Big picture

    I often get asked how to learn about web scraping. Here is my advice.

    First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don’t need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

    The following advice will assume you want to use Python for web scraping.
    If you have some programming experience then I recommend working through the Dive Into Python book:

    Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

  • How to automatically find contact details

    Information retrieval Python Example

    I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

    This wastes my time so I use this snippet to automate extracting the available emails:

  • Webpage screenshots with webkit

    Webkit Qt Python Screenshot Example

    For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:

  • Crawling with threads

    Concurrent Python

    The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.

  • Why reinvent the wheel?

    Lxml Xpath Python Scrapy Beautifulsoup

    I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

    I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

    To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
    The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

    To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn’t find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

  • Caching crawled webpages

    Python Cache

    When crawling large websites I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly and avoid extra load on their website server. This is often necessary when a client realizes they require additional features included in the scraped output.

    I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

    Here is some example usage of pdict:

  • Scraping JavaScript webpages with webkit

    Javascript Webkit Qt Python

    In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

  • Why Python

    Python

    Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote an article on why he likes Python that I expect resonates with many.

  • Web scraping with regular expressions

    Regex Python

    Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let’s say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:

  • Parsing HTML with Python

    Lxml Python Html

    HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

    Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags: