Overview
What is web scraping?
The internet contains a huge amount of data but most is not in a useful format. Web scraping is the process of extracting this data from websites into a structured format such as a CSV spreadsheet so it can be reused.
Is it possible to extract data from any website?
Yes – if the data is publically available then it can be extracted, though it may not be practical for some websites. For example if the website heavily restricts IP addresses then scraping their data would require renting a lot of proxies, which may make the project too expensive.
Is web scraping legal?
Scraping data from public websites is very common and many businesses like Google depend on it. I find in practice that scraping the data is not a problem. Any potential problem depends on how you reuse the data. If the data is for private use then no problem. I expand on this in this blog post
How can I learn about web scraping?
I have collected many useful web scraping resources on my blog, which is a good start. Also my bitbucket account hosts some open source web scraping scripts.
Who are you?
My name is Richard Penman and I am originally from Melbourne in Australia, but often travel alongside my web scraping work – have worked from over 50 countries so far. I have a B.E. from Melbourne University and an MSc in Computer Science from Oxford University.
How did you get involved in this field?
I first encountered the field of web scraping in 2006 while studying at Melbourne University. After graduation I worked for a few years in a research lab and continued my web scraping projects in my spare time as a hobby. I found there was significant demand in this field so eventually left my job and began working on web scraping projects full time. Since then I have scraped data from thousands of websites that require parsing JavaScript/AJAX, using proxies, solving CAPTCHA's, and contain millions of records. Initially my website was hosted at sitescraper.net and then later I was able to purchase the awesome domain webscraping.com.
What technologies do you use?
- Our web scraping infrastructure has been developed using the Python language, much of which is open sourced as the webscraping library
- For processing JavaScript I use WebKit (through PyQt) or Selenium
- And for running the crawls I rent servers on Amazon EC2, DigitalOcean, or similar
What languages do you speak?
I speak native English, intermediate Mandarin, basic Korean, and fluent Esperanto!
Ordering a custom website scrape
How much will it cost to scrape a website?
These are the main factors that make a job more difficult, and therefore more expensive:
- Restrictions on the number of page views per user, which means I need to use multiple IP addresses
- Badly or inconsistently structured data
- Obfuscated data, which needs to be decoded
- Data dynamically loaded with Javascript
- Data embedded in Flash or images
- The website contains a huge quantity of data
If the website is relatively small, well structured, and the data is embedded cleanly in the HTML then I would expect to quote ~$150 USD. Prices are discounted when ordering multiple website scrapes. Complete the automatic quote form to get an idea of cost.
I am not the cheapest because I am not the worst.
How long does it take to scrape a website?
A simple website can be scraped within a few hours while a larger one will take several weeks to download all the required data. When we have received your project details we will give you an estimation of the time required.
If I hire you for a custom scrape will you resell the data here?
No – that data will be just for you.
How can I hire you?
Just fill in the automatic quote form and I will look it over and get back to you within 1 business day.
These are the typical stages in each web scraping project:
- Discuss with client what data they need
- Crawl the website to download the relevant webpages
- Extract the required features (eg name, address) from each webpage using XPath or Regular Expressions
- Write these features to an output file (eg CSV, MySQL database)
- Check with client whether output is as expected and prepare updates if necessary
- Finalize payment
I have a big project – can you handle it?
Maybe not alone so I have trained some other people at web scraping and we collaborate on the bigger projects.
Do I get the source code?
Certainly. We use Python 2.7 for most projects. Some websites require downloading GB's of data and are difficult to scrape without proxies, so we can also rescrape the data in future for a fee.
How does payment work?
For a custom website scrape I will quote a fixed fee for the job and if you are a new client then I will request a deposit of half upfront – this deposit will be refunded if I can not finish the project. Larger projects can be split into a number of milestones.
The invoice has payment options for PayPal, Credit Card, Bank transfer, and now also Bitcoin.
If you are not comfortable with paying part up front to a random guy over the internet (which is understandable) then we can use Elance, which supports an Escrow system to hold payments until job completion. (Note that to cover Elance's fee this will cost 8.75% extra.)
Can I get a refund?
- If I can not complete a project then of course a full refund will be made.
- If you want to cancel a project and I have not yet started then a refund can be made.
- If your requirements change after work has begun we can negotiate a new quote.
Can you extract content from Chinese / Hebrew / etc websites?
Yes – this is still text and can be extracted just like English. I also use Google Translate to help me understand how the website works.
Will you scrape this adult website? No
Purchasing a database
How can I purchase a database?
- Browse to the database you are interested in and choose the preferred method of payment.
- If paying with PayPal or credit card then after completing payment you will be redirected back to the database page and be able to download it. An email will also be sent with the database details.
- If paying with Bitcoin then once the transaction is confirmed an email will be send with details.
- Note that to download a database you must be logged into your account
You don't have the database I am after. Can you get it?
I hope so! If the database is of general interest then I will scrape it and upload here for you to purchase. Please contact me to discuss details.
Can you provide the data in a different format?
I provide CSV format by default because it is straightforward to parse and widely supported. But if an alternative format (such as MySQL or JSON) would be more convenient I can add it.
Can you include more fields in the database?
If the fields you are after are publicly available then yes they can be included in the database. Let me know what additional fields would be useful.
I have purchased a database. How long will it be available to download?
As long as this website is running (since 2009, and no plans to stop). Also you will get free access to all future updates of that data set.
How often do you update the databases?
Depends on how popular the database is. For a popular database like Android applications I update the data every few months. If you need regularly updated data then reach out and we can work something out.