Overview

What is web scraping?

The internet contains a huge amount of data but most is not in a useful format. Web scraping is the process of extracting this data from websites into a structured format such as a CSV spreadsheet so it can be reused.

Is it possible to extract data from any website?

Yes - if the data is publically available then it can be extracted, though it may not be practical for some websites. For example if the website heavily restricts IP addresses then scraping their data would require renting a lot of proxies, which may make the project too expensive.

Is web scraping legal?

Scraping data from public websites is very common and many businesses like Google depend on it. I find in practice that scraping the data is not a problem. Any potential problem depends on how you reuse the data. If the data is for private use and the web crawling does not impact other users, then should not be fine. I expand on this in this blog post

How can I learn about web scraping?

I have collected many useful web scraping resources on my blog, which is a good start. I also published a book on web scraping with Packt. And my bitbucket account hosts some open source web scraping scripts.

Who are you?

My name is Richard Penman and I am the founder of webscraping.com. I am originally from Australia but often travel alongside this business - have worked from over 50 countries so far. I have a B.E. from Melbourne University and an MSc in Computer Science from Oxford University. I have also trained a few others at web scraping who I collaborate with on larger projects.

How did you get involved in this field?

I first encountered the field of web scraping in 2006 while studying at Melbourne University. After graduation I worked for a few years in a research lab and continued some web scraping projects in my spare time as a hobby. I found there was significant demand in this field so eventually left my job and began working on web scraping projects full time. Since then I have scraped data from thousands of websites that require parsing JavaScript/AJAX, using proxies, solving CAPTCHA's, and contain millions of records.

What technologies do you use?

  • Our web scraping infrastructure has been developed using the Python language, much of which is open sourced as the webscraping library
  • For processing JavaScript I use WebKit (through PyQt) or Selenium
  • And for running the crawls I rent servers on Amazon EC2, DigitalOcean, or similar
Ordering a custom website scrape

How much will it cost to scrape a website?

These are the main factors that make a job more difficult, and therefore more expensive:

  • Restrictions on the number of page views per user, which means I need to use multiple IP addresses
  • Badly or inconsistently structured data
  • Obfuscated data, which needs to be decoded
  • Data dynamically loaded with Javascript
  • Data embedded in Flash or images
  • The website contains a huge quantity of data

If the website is relatively small, well structured, and the data is embedded cleanly in the HTML then I would expect to quote ~$150 USD. Prices are discounted when ordering multiple website scrapes. Complete the automatic quote form to get an idea of cost.

I am not the cheapest because I am not the worst.

How long does it take to scrape a website?

A simple website can be scraped within a few hours while a larger one will take several weeks to download all the required data. When I have received your project details I will give you an estimation of the time required.

If I hire you for a custom scrape will you resell the data?

No - that data will be just for you.

How can I hire you?

Just fill in the automatic quote form and I will look it over and get back to you within 1 business day.

These are the typical stages in each web scraping project:

  • Discuss with client what data they need
  • Crawl the website to download the relevant webpages
  • Extract the required features (eg name, address) from each webpage using XPath or Regular Expressions
  • Write these features to an output file (eg CSV, MySQL database)
  • Check with client whether output is as expected and prepare updates if necessary
  • Finalize payment

I have a big project - can you handle it?

Maybe not alone so I have trained some other people at web scraping and we collaborate on the bigger projects.

Can I get regular updates of the data?

Certainly - I can provide a quote for regular updates of a scraped dataset and handle any required maintenance.

Do I get the source code?

Yes. I use Python for most projects and can provide the code used. Note that if the script is used without proxies for a large website then it may be blocked.

How does payment work?

For a custom website scrape I will quote a fixed fee for the job and if you are a new client then I will request a deposit of half upfront - this deposit will be refunded if I can not finish the project. Larger projects can be split into a number of milestones.

The invoice has payment options for PayPal, Credit Card, and Bank transfer.

If you are not comfortable with paying part up front to a random guy over the internet (which is understandable) then we can use Upwork, which supports an Escrow system to hold payments until job completion. (Note that to cover Upwork's fee this will cost 8.75% extra.)

Can I get a refund?

  • If I can not complete a project then of course a full refund will be made.
  • If you want to cancel a project and I have not yet started then a refund can be made.
  • If your requirements change after work has begun we can negotiate a new quote.

Can you extract content from Chinese / Hebrew / etc websites?

Yes - this is still text and can be extracted just like English. I also use Google Translate to help me understand how the website works.

Will you scrape this adult website?

No