The future of big data federation may have just landed - Matt Asay

Big data is no longer a war between batch and streaming data processing. It's not either/or, but rather a matter of "and."
Enterprises increasingly need to integrate batch and streaming data processing within a common framework, from runtime to analytics. And yet, with the rise of streaming analytics, integrating streaming data into batch-based systems like Hadoop is non-trivial. Getting data from disparate data stores and running analytics on them in real-time is a huge technological challenge.

Cracking this data federation problem has become a Holy Grail of sorts. And while there are two primary approaches to cracking this problem today, a third has emerged that just might offer the most promise.

The state of the data federation market

One of the ongoing barriers to greater big data adoption is the complexity of the associated software. The industry really needs to tackle this, offering end users the ability to query whatever data they want, wherever it is, no matter what format, and all without going through IT.

Which is mostly impossible today.

There are two general approaches used for data federation, both with their strengths and weaknesses. The first is a database-centric approach, used by relational database (RDBMS) vendors like Teradata (QueryGrid) and IBM (FluidQuery) or by specialty technologies like the former Composite Software.

One of the biggest problems with such database-centric tools is that they're geared for DBA-type users, not business users and analysts. Further, these tools generally do not cover all types of big data. Most were designed for data that fits into tables and columns, but search, streams, and semi-structured or unstructured data (for which NoSQL databases are well-suited) do not necessarily fit as well.

In addition, performance can sometimes be an issue when attempting to perform speed-of-thought analytics on a traditionally-federated source.

The second approach is a query tool-centric approach, used by Tableau, Qlik, and others.

These technologies do allow end users to mashup multiple sources, but they may not scale to big data volumes, as data is mashed up often on the user's desktop computer or web browser rather than in a scalable big data backend like Apache Spark.

And again, they were not really designed for the variety of big data sources and for anything beyond fairly trivial low-cardinality mashups.

The future of data federation

There are, however, glimmers of hope.
For example, Zoomdata just announced its Fusion product, with an early access program to give companies a taste. Zoomdata claims Fusion can make multiple data sources appear as one source without moving or transforming data.

If it works as advertised, this would allow a business user to define a fused data source without waiting for a data architect to set it up ahead of time. Without resorting to a command line, Fusion is exposed as a simple drag-and-drop user interface that hides the underlying Spark-based infrastructure that combines datasets in ways hitherto impossible.

While interesting in itself, the real power comes from Zoomdata's ability to push as much as the processing to each underlying data platform as possible, based on the capabilities and performance profile of those systems, and use Spark to do the rest of the work that can't or shouldn't be pushed down.

This emphasis on ease of use opens up big data to even more business user adoption. The traditional federated query approach required an enterprise architect to think through all the data sources to join, then do elaborate coding and establish rules and tune parameters before users could even ask questions.
That's really why this technology never really worked well before, since no one knows exactly the right questions to ask ahead of time, plus it was often really slow to actually run federated queries.

The Zoomdata approach is the exact opposite. It allows users to hook up their own data and run queries with fast results. That ability to truly iterate on big data—historical AND real-time data, enterprise, and cloud data—can be transformative to a company.

Zoomdata may be first to take this approach, but they won't be the last. There's just too much at stake. With the big data market set to boom by 50% over the next few years, according to Ovum, there are big incentives for those that can lower the bar to greater big data adoption.