Sunsetting ScrapeShark
It's time to set the shark loose.
After nearly four years, it’s time to call it a day for ScrapeShark. What started out as a side project back in 2016 to scrape real estate listings managed to make it into a standalone service in 2020, but it never really took off. This post outlines some of the learnings from those two years of creating and operating an extremely niche service.
ScrapeShark was a service that allowed pretty much anyone who knows how to call an API to extract the contents of virtually any web page. It also had novel support for converting the content of websites into a more on-the-fly structured format, e.g. JSON. Web scrapers such as these are employed for many different reasons, including building databases from data across the web, monitoring competition, etc.
The first problem ScrapeShark ran into is that it was an extremely niche product, making it extremely difficult to reach new customers. The customers that did trickle in were mostly from organic search, as targeted advertising was still too broad to effectively reel in new customers.
Being a prosumer product, many of the customers that were interested in ScrapeShark were already running a scraping operation of some sort, highly tailored to their specific use case. Obviously, the best I could provide is a blanket solution that works for most use cases, but definitely not all. Therefore, ScrapeShark never really fully replaced the existing solution for its customers; it either extended it by providing some functionality that their homebrew solution didn’t have - e.g. browser cycling, fingerprint avoidance, proxies - but it never really became a core component. For a product such as ScrapeShark to become successful, it has to effectively become entombed in the stack of its customers.
Services like ScrapeShark are what I’d like to call peripheral services; they aren’t part of any company’s core business, they merely exist to help a business establish its core business. This means it’s inevitably prone to being replaced by alternatives over time, as it is a technology business, and technology only ever improves. The only way to counteract this is by becoming irreplaceable, as a source of truth, an irreplaceable piece of infrastructre, or both. ScrapeShark didn’t have any “grow-in”, like a storage provider does where if you move away from the provider, you’re going to have to put in an unknown amount of work to move your data over. If you want, you could just move away from ScrapeShark to any other competitor that offers a similar service, and there was no shortage of them.
The fact that it was so easy to be replaced made the whole thing a losing proposition. Its key distinguishing features like automatically converting HTML into JSON were barely used, and onboarding new customers by helping them write scrapers for their use case effectively means doing free client work, because as soon as they were up and running, they could simply switch to a competitor. Customer loyalty isn’t a great foundation for an online business, let alone for one that provides a niche service to mostly highly technical, highly pragmatic developers.
From a technical perspective, I’m still proud of what ScrapeShark was. At its peak, it served some tens of thousands of requests per day without any issues, provided millions of unique, untraceable browser instances, all from commodity hardware. It ran its own network of rotating proxies, and it’s been a great learning experience both from a technical and a business perspective.
I’ll be open sourcing the engine that powered the scraping aspect of ScrapeShark at some point in the future, when the code is ready for it. It is without a doubt the most useful component in ScrapeShark, and perhaps it can help a person or two out there.
Thanks for everyone who has supported ScrapeShark over the years. It’s been great.