General

How did you get involved in this field?

I first encountered the field of web scraping in 2006 while studying computer science at Melbourne University. After graduating I worked for a few years in a research lab and continued my web scraping projects in my spare time as a hobby. I found there was a big demand in this field so eventually left my job and began scraping data full time. Since then I have scraped data from websites that require parsing JavaScript/AJAX, using proxies, solving CAPTCHA's, and contain millions of records.

What technologies do you use?

  • My web scraping infrastructure has been developed using the Python language
  • For large crawls I use Amazon's ec2 cloud service
  • For processing JavaScript I use webkit through PyQt

Where do you live?

I am from Melbourne in Australia but often travel alongside my web scraping work.

How can I learn about web scraping?

I have collected some useful web scraping resources on my blog, which is a good start.

What languages do you speak?

I speak native English, basic Korean, and fluent Esperanto!

Is web scraping legal?

Scraping data from public websites is very common and many businesses like Google depend on it. I find in practice that scraping the data is not a problem. Any potential problem depends on how you reuse the data. If the data is for private use then no problem.

Buying a Database

How can I purchase a database?

  • Browse to the database you are interested in and click the "Buy Now" button
  • You will be redirected to a payment page with PayPal and credit card options
  • After completing the payement from you will be redirected back to the database page and be able to download the database
  • You will also recieve an email with the database details so you can download again later

You don't have the database I am after. Can you get it?

I hope so! If the database is of general interest then I will scrape it and upload here for you to purchase. Please contact me to discuss details.

Can you provide the data in a different format?

I provide CSV format by default because it is straightforward to parse and widely supported. But if an alternative format (such as MySQL or JSON) would be more convenient I can add it.

Can you include more fields in the database?

If the fields you are after are publicly available then yes they can be included in the database. Let me know what additional fields would be useful.

I have purchased a database. How long will it be available to download?

As long as this website is running (3 years already, and no plans to stop). Also you will get free access to all future updates of that data set.

How often do you update the databases?

Depends on how popular the database is. For a popular database like Android applications I update the data every few months. If you need regularly updated data then reach out and we can work something out.

Ordering a Custom Website Scrape

How much will it cost to scrape a website?

These are the main factors that make a job more difficult, and therefore more expensive:

  • Restrictions on the number of page views per user, which means I need to use multiple IP addresses
  • Badly or inconsistently structured data
  • Obfuscated data, which needs to be decoded
  • Data dynamically loaded with Javascript
  • Data embedded in Flash or images
  • The website contains a huge quantity of data

If the website is relatively small, well structured, and the data is embedded cleanly in the HTML then I should be able to do it for $150 USD. Prices are discounted when ordering multiple website scrapes. Complete the automatic quote form to get an idea of cost.

I am not the cheapest because I am not the worst. However I am a one man band so my overhead is less than most.

Can you extract data from any website?

Yes - if the data is publically available then it can be extracted, though it may not be practical for some websites. For example if the website heavily restricts IP addresses then it might take months to download the entire website.

How long does it take to scrape a website?

A simple website can be scraped within a few hours while a larger one will take several weeks to download all the required data. When we have received your project details we will give you an estimation of the time required.

If I hire you for a custom scrape will you resell the data here?

No - that data will be just for you.

How can I hire you?

Just fill in the automatic quote form and I will look it over and get back to you within 1 business day.

These are the stages in each web scraping project:

  • Discuss with client what data they need
  • Crawl the website to download the relevant webpages
  • Extract the required features (eg name, address) from each webpage using XPath or Regular Expressions
  • Write these features to an output file (eg CSV, MySQL database)
  • Check with client whether output is as expected and prepare updates if necessary
  • Finalize payment

I have a big project - can you handle it?

Maybe not alone so I have trained some other people at web scraping and we collaborate on the bigger projects.

Do I get the source code?

Certainly. We use Python 2.7 for most projects. Some websites require downloading GB's of data and are difficult to scrape without proxies, so we can also rescrape the data in future for a fee.

How does payment work?

For a custom website scrape I will quote a fixed fee for the job and if you are a new client then I will request a deposit of half upfront - this deposit will be refunded if I can not finish the project. Larger projects can be split into a number of milestones.

For most projects I receive payment via PayPal, but we can discuss other options like bank transfer if more convenient.

If you are not comfortable with paying part up front to a random guy over the internet (which is understandable) then we can use Elance, which supports an Escrow system to hold payments until job completion. (Note that to cover Elance's fee this will cost 8.75% extra.)

Can I get a refund?

  • If I can not complete a project then of course a full refund will be made.
  • If you want to cancel a project and I have not yet started then a refund can be made.
  • If your requirements change after work has begun we can negotiate a new quote.

Can you extract content from Chinese / Hebrew / etc websites?

Yes - this is still text and can be extracted just like English. I also use Google Translate to help me understand how the website works.

Will you scrape this adult website?

No