Sunsetting ScrapeShark

It's time to set the shark loose.

Published

The sun has set for you

After nearly two years, it’s time to call it a day for ScrapeShark. This is a passion project that started back in 2016 as a scraper for Dutch real estate listings, and grew into its own commercial product offering in early 2020.

Launching a product in a space that’s already occupied with stiff competition offering very similar products is rough to say the least. At the time of launch, Scraping Bee was offering a very similar API to mine, with browser-as-a-service Browserless and superproxy Luminati (now Bright Data) operating on the verges.

We launched with a fairly rudimentary, but performant feature set. At launch, ScrapeShark could circumvent 96.7% of all anti-scraping measures with sub 2 second loading times. At the time, that was faster than most of the competition.

ScrapeShark was positioned as a cost-effective alternative to its direct competitors, and while it did see a limited amount of success — notably with smaller marketing automation focused clients — its costs were also quite steep to service a handful of customers due to the way it is built.

The way ScrapeShark worked under the hood was by creating multiple isolated browser processes and controlling those processes using WebDriver. At first it used its own implementation, later on I pawned off the browser control to Microsoft Playwright, which did a far better job managing browsers and was more resource efficient as well.

At its core though, every request was serviced by your very own, private instance of a Chromium or Firefox browser. Chrome is notoriously RAM hungry, so we quickly realised that having each request being serviced by a dedicated process wasn’t a great idea. We introduced BrowserContext as an isolated container that could serve multiple requests from a pool of browser processes while still guaranteeing per-request security. This helped manage the resource-hungriness of the engine behind ScrapeShark, but it still required quite some beefy hardware.

Containers and Functions as a Service couldn’t give us the performance we needed to serve requests under the 3 second upper limit we had set for ourselves, so we had to resort to dedicated resources to serve Monocle, the engine behind ScrapeShark. Sadly, dedicated resources still don’t come cheap.

This meant that the cost of operating ScrapeShark were fairly steep from the start. Over time, it grew to support itself and covered its costs, but it wouldn’t buy you more than a few lunches a month.

Another big reason is that — offering an anti-anti-scraping service — you’re effectively playing a cat and mouse game with giants such as Cloudflare. Keeping our browsers under the radar by constantly tweaking user agents, maintaining our own proxy pool, custom scripts (thanks CreepJS!), etc became very time consuming and sometimes weeks of tinkering would be defeated the next day.

That’s been the status quo for the better part of the past year. So why call it quits now?

Because honestly, I think if you have any serious reliance on scraping in your business, you are far better off using a purpose-built scraping solution than a generic one such as ScrapeShark. You know your use cases, you can tweak the browsers perfectly to suit those use cases better than I ever could.

This is the end of the commercial offering of ScrapeShark. However, I will be open sourcing the engine that powered ScrapeShark so you can use it for your own purposes and will be providing support for it on GitHub.

Existing customers can migrate to the open source version with relatively small code changes. If you need help setting it up, feel free to reach out and I’d be more than happy to help.

Thanks everyone, it has been a hoot.