When we first set out to start Reviewshake, we quickly realized that our most vital ingredient would be review data. Similar to how a car runs on power (electricity or gas), a review management platform runs on online reviews. While there are multiple providers who offer review data today (including ourselves through our Review Scraper API), these did not exist when we started Reviewshake several years ago.
So we set out to solve our own problem, and have learned a great deal along the way. We started off consuming this review data in our own tool, and quickly realized that we could commercialize this technology for others to benefit from an easy API instead of scraping reviews themselves. Since then, thousands of developers at leading organizations like PwC, BCG, Deloitte, Duke and Princeton Universities and more have used our technology.
This was an inflection point in our ability to double down on this product, not just for ourselves but for paying customers. What follows are some of the learnings we picked up along the way.
API vs. scraping
In an ideal world, review data would be available over API, but unfortunately this is not the case. We use APIs where possible, though the majority of the 85+ review sites we source review data from do not have APIs, which means that we need to resort to web scraping. In some cases we also have partnerships with individual review sites.
Choose your scraping library
First things first – what programming language are you most comfortable with? This will determine which scraping library you use. For example, Python has Scrapy, Ruby has Nokogiri and there are many other options to choose from.
There are several considerations at play here, for example: How robust is the library you have chosen? How easy is it to hire talented developers who have experience with that library? How scalable is it?
We built our system in Ruby, mainly because that was my strongest language at the time. This affected several decisions which will be touched upon below – such as using Sidekiq for background processing, ActiveAdmin for the admin panel and more.
Write your scraper
At this point you can already start writing your scraper – I won’t go into too much detail on how this is done, as there is already a lot of content out there on this topic. In a nutshell, you need to request the review results page and then parse out the fields you want to store.
At Datashake, our scrapers typically follow this format:
- Determine how many pages of reviews are available for pagination
- Identify the markup that contains the reviews
- Iterate over each review and store its content
In some cases, it helps using a network sniffer because some sites load their data over APIs which will be easier (and more maintainable) to use than parsing code. Another factor to keep in mind is whether the page is loaded asynchronously, in which case you will need to use a headless browser as opposed to a simple HTTP request.
Depending on the scale you will be scraping reviews, it’s worth considering how you will be dealing with concurrency once you’ve written the scraper. We decided to use Sidekiq to process our jobs in the background, which makes it incredibly easy for us to manage different queues and scale up and down as necessary. We also use sidekiq-throttled to ensure that we aren’t hammering the review site and our vendors with too many requests.
As we grew our operation, we started having database concurrency issues and so made a lot of tweaks to the database to optimize for our workload.
Once you start scaling up your scraping, you will no doubt run into blocking mechanisms from the review site(s) in question. There are varying solutions to this problem:
- Full-service scraping providers where you request a URL and they handle the blocking mechanisms on their side.
- Proxy providers who provide data center, residential and mobile IP addresses.
- Captcha solving services who automate catpcha solving at scale.
- Headless browser services, that make managing headless browsers easier at scale.
For some sites, you will need to make requests with specific headers and/or cookies, along with a host of other methods to circumvent blocking mechanisms.
In our experience, it always helps to have fallbacks for the services you use, so that you don’t have a single point of failure that could take your entire operation down. At Datashake we have automated fallbacks and tests (more on that later) to ensure as near 100% as possible for our operations.
Once you’ve started scraping reviews at scale, you will want to optimize your scraping so that you’re not wasting compute and other resources. After you’ve fetched all the reviews from a given review profile, you will likely want to keep fetching the latest reviews as they come in.
To that end, you will need to build algorithms which determine which reviews you already have, and which reviews are new. This is much trickier than it might sound at first, as there are all kinds of formatting, pagination, ordering and more complexities. If the review profile has 100 pages, your goal is to stop the scraping when you have all the latest reviews, so you’re not going through all pages every single time you check for updates.
At Datashake, we expose several parameters which abstract this complexity from our users:
- diff: This parameter lets you set a previous job ID for your given profile, so we only return the latest reviews.
- from_date: Only scrape reviews from a specific date.
- blocks: Number of blocks to return from results, in blocks of 10.
Data cleaning is an essential part of web scraping, as you always need to ensure that the data you ingest follows a standard format. Firstly, we suggest encoding your database to utf8mb4-bin, as this allows for text in many languages, has emoji support and other text that you will no doubt run into.
Formatting of dates is particularly tricky, especially once you start scraping from multiple sites. This is because there is no standard format for dates, as for example Americans could use yyyy-mm-dd while other countries use yyyy-dd-mm. We’ve even seen cases where the same review site uses differing formats, to make matters worse.
Beyond this, some sites have reviews with headers, questions and other meta data which need to be dealt with.
We take monitoring very seriously at Datashake, as we want to know about issues before our customers know about them. The worst case scenario is that we receive an email from our customer notifying us about an issue, and this is where our monitoring system comes in. We monitor everything:
- Monitor the status of every single job that comes through our system
- Monitoring wait and process times per job, averages across sites
- Monitoring the performance of our various providers
- Frequent tests of each review site, comparing expected results with real-time results
We have a heavily customized ActiveAdmin dashboard which helps us monitor and take action where necessary. We also use Rollbar for real-time tracking, and have Asana automations to help with the management of any issues.
When you talk about web scraping, this question comes up sooner or later. We can write at length about this, and will do so in another blog post to give it the attention it deserves. In the meantime, read more about our legal position.
As you can see from the above, running a high quality web scraping operation at scale is quite the complex endeavour. Luckily we have made our technology available via API, so all you need to do is call 2 API endpoints instead of using valuable engineering resources reinventing the wheel.
Sign up for an account here: https://app.datashake.com