On Wednesday, Cloudera and Hortonworks announced a “merger of equals,” where Cloudera is acquiring Hortonworks with stock so that Cloudera shareholders end up with 60 percent of the combined company. The deal signifies that the Hadoop market could no longer sustain two big competitors. Hadoop has been synonymous with big data for years, but the market — and customer needs — have moved on. Several megatrends are driving this change:
The public cloud tide is rising
The first megatrend is the shift to public cloud. Companies of all sizes are increasing their adoption of AWS, Azure, and Google Cloud services at the expense of on-premises infrastructure and software. Enterprise server revenues reported by IDC and Gartner continue to decline. The Top 3 cloud providers (90 percent of the market) offer their own managed Hadoop/Spark services, such as Amazon’s Elastic Map Reduce (EMR). These are fully integrated offerings that have a lower cost of acquisition and are cheaper to scale. If you’re making the shift to cloud, it makes sense to look at alternative Hadoop offerings as part of that – it’s a natural decision-point. Ironically, there has been no Cloud Era for Cloudera.
Crushing storage costs
The second megatrend? Cloud storage economics are crushing Hadoop storage costs. At introduction in 2005, the Hadoop Distributed File System (HDFS) was revolutionary: It took servers with ordinary hard drives and turned them into a distributed storage system capable of parallel IO consumable by Java apps. There was nothing like it, and it was a crucial component that allowed large scale data sets that didn’t fit onto a single machine to be processed in parallel. But that was 13 years ago. Today, there is a plethora of much cheaper alternatives, primarily object storage services like AWS S3, Azure Blob Storage, and Google Cloud Storage. A terabyte of cloud object storage costs about $20 a month, compared to about $100/month for HDFS (not including the cost to operate it). Which is why Google’s HDFS service, for example, is merely a shim that translates HDFS operations onto object storage operations – because that’s 5x cheaper.
Faster, better, and cheaper cloud databases
Hadoop’s problems don’t end there, because it’s not just about direct competition from cloud-vendor Hadoop/Spark services and cheaper storage. The third megatrend is the advent of “serverless” cloud services that completely eliminate the need to run Hadoop or Spark at all. A common use case for Spark is to handle ad-hoc distributed SQL queries for users. Google was first to market with a revolutionary service called BigQuery in 2011 that solves the same problem in a completely different way. It lets you run ad-hoc queries on any amount of data stored in its object storage service (you don’t have to load it into special storage like HDFS). You just pay for the compute time: If you need 1,000 cores for 3.5 seconds to run your query, that’s all you pay for. There is no need to provision servers, install the OS, install software, configure everything, scale the cluster to 1,000 nodes, and feed and care for the cluster as you would with Hadoop/Spark. Google does all that, hence the moniker “serverless.” There are banks running 2,000-node Hadoop/Spark clusters operated and maintained by scores of IT people that can’t match BigQuery’s flexibility, speed, and scale. And they have to pay for all the hardware, software, and people to run and maintain Hadoop.
BigQuery is just one example. Other cloud database services are similarly massive scale, highly flexible, globally distributed “pay for what you use” databases. There’s start-up Snowflake, Google Big Table, AWS Aurora, and Microsoft Cosmos. They’re all much easier to use than a Hadoop/Spark install, and you can be up and running in 5 minutes for tens of dollars – no $500k purchase order and weeks of installation, configuration, and training required.
Python and R data science running on containers and Kubernetes
The fourth megatrend is containers and Kubernetes. Hadoop/Spark is not just a storage environment but also a compute environment. Again, back in 2005, this was revolutionary – the Map-Reduce approach of Hadoop provided a framework for parallel computation of Java applications. But the Java-centric nature (Scala-centric for Spark) of Cloudera and Hortonworks infrastructure is at odds with today’s data scientists doing machine learning in Python and R. The need to constantly iterate and improve machine learning models and to have them learn on production data means native deployment of Python and R models is a necessity, not a “nice to have.”
As recently as this week, the big Hadoop vendors’ advice has been “translate Python/R code into Scala/Java,” which sounds like King Hadoop commanding the Python/R machine learning tide to go back out again. Containers and Kubernetes work just as well with Python and R as they do with Java and Scala, and provide a far more flexible and powerful framework for distributed computation. And it’s where software development teams are heading anyway – they’re not looking to distribute new microservice applications on top of Hadoop/Spark. Too complicated and limiting.
A shift in data gravity
The net is that after a good 10 years of Cloudera and Hortonworks being the center of the Big Data universe, the center of gravity has moved elsewhere. The leading cloud companies don’t run large Hadoop/Spark clusters from Cloudera and Hortonworks – they run distributed cloud-scale databases and applications on top of container infrastructure. They do their machine learning in Python, R, and other languages that are not Java. Increasingly, enterprises are shifting to similar approaches because they want to reap the same speed and scale benefits. It’s time for the Hadoop and Spark world to move with the times.
To Read Our Daily News Updates, Please visit Inventiva or Subscribe Our Newsletter & Push.