IBM Backs Apache Spark For Big Data Analytics

IBM today announced support for the open source Apache Spark project, giving another boost to this increasingly popular in-memory data processing framework. Spark both complements and — in some cases — competes with the big data power of the better-known Apache Hadoop. As businesses continue to look for data-based approaches to their work, technology vendors are rushing to support promising open source projects, and to differentiate themselves from competitors. Can IBM gain credibility through association with Spark, and can the open source project benefit from Big Blue’s experience, customer-base, and bank balance?

Spark began life in as a project at UC Berkeley in California, quickly delivering in-memory performance as much as 100 times that of the MapReduce framework that originally underpinned Apache Hadoop. Hadoop has moved on since then, to adopt other — faster and more flexible — ways of working. Spark has also progressed, promoting increasingly capable disk-based performance to complement its in-memory strengths, and establishing itself as a strong contender for use particularly in machine learning tasks. Spark moved to the Apache Software Foundation in 2013, becoming a top level project in 2014. In 2013, members of the original Berkeley team established the company now known as Databricks to build a business around Spark. The company launched with almost $14 million dollars from Andreessen Horowitz and others, and secured a further $33 million a year ago. And yet Spark is not without competitors of its own. Flink, which is also a top-level project of the Apache Software Foundation, has just recently begun to attract many of the same admiring comments directed Spark’s way 12-18 months ago. Despite sound technical credentials, ongoing development, big investments, and today’s high-profile endorsement from IBM, it would be unwise (and implausible) to crown Spark as the winner just yet.

IBM announced a number of initiatives today, aligning with what the company PR machine calls

potentially the most significant open source project of the next decade.

These include:

deepening the integration between Apache Spark and existing IBM products like the Watson Health Cloud;
open sourcing IBM’s existing SystemML machine learning technology;
tasking 3,500 IBM engineers to work on Spark-related projects, including those at a new Spark Technology Center in San Francisco;
offer Spark as a Service, hosted on IBM Bluemix;
partner with AMPLab and others to ‘educate more than 1 million data scientists and data engineers on Spark.’

As so often with these press releases, it’s effectively impossible to work out what’s really new here. It seems implausible, for example, that IBM was not already deepening product integration with Spark.

In the enterprise market, where IBM remains a powerful force, Spark is almost unheard of. As Gartner’s Nick Heudecker told VentureBeat,

In the enterprise, I’m seeing almost no Spark adoption.

There, Flink is also effectively invisible. Hadoop has much of the mindshare, whether it’s the right tool for the job or not. Startups like Cloudera, Hortonworks and MapR make money supporting those enterprise adoptions, as do the big data operations of established vendors like HP, EMC and IBM.

IBM’s very public backing for Spark will open enterprise doors. And, if startups like Databricks are smart, it opens doors for them almost as much as it does for IBM.

Andreessen Horowitz’s millions got Silicon Valley’s cool kids to sit up and pay attention. IBM’s posturing might do just the same in the uncool boardrooms of the Fortune 1,000.

(Forbes)