Your web browser will send what is known as a “User Agent” for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:
Browser | User Agent |
---|---|
Firefox on Windows XP | Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 |
Chrome on Linux | Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3 |
Internet Explorer on Windows Vista | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) |
Opera on Windows Vista | Opera/9.00 (Windows NT 5.1; U; en) |
Android | Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3 |
IPhone | Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 |
Blackberry | Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+ |
Python urllib | Python-urllib/2.1 |
Old Google Bot | Googlebot/2.1 ( http://www.googlebot.com/bot.html) |
New Google Bot | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
MSN Bot | msnbot/1.1 (+http://search.msn.com/msnbot.htm) |
Yahoo Bot | Yahoo! Slurp/Site Explorer |
You can find your own current User Agent here.
Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.
Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.
Fortunately it is easy to set your User Agent to whatever you like:
- For FireFox you can use User Agent Switcher extension.
- For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser –user-agent=”my custom user agent”
- For Internet Explorer you can use the UAPick extension.
-
And for Python scripts you can set the proxy header with:
proxy = urllib2.ProxyHandler({‘http’: IP})
opener = urllib2.build_opener(proxy)
opener.urlopen(‘http://www.google.com’)
Using the default User Agent for your scraper is a common reason to be blocked, so don’t forget.