Analytics Divided Will Fail: A Holistic Approach To Big Data Discovery - Dan Woods

Companies have been buying a wide variety of different tools that specialize in niche facets of the data discovery process. That made sense when everything was new and being tried for the first time, but now that approach is a hindrance that leaves companies with too many technologies that require too many experts and need to be coordinated and integrated. Now that big data analytics has a bit of maturity, it is time for organizations, particularly those in the early days of their foray into analytics, to re-evaluate that approach.

It is time for big data to learn from biology. Right now, we see tremendous excitement all over the world about powerful parts of the big data system. It is now time for us to think holistically rather than focusing on one system at a time.

The Memory of Big Data Analytics

When it comes to big data, it all started with Hadoop, which changed the economics of data processing. It was always possible to process big data, but it was so expensive that only the highest value data was analyzed. When you had to spend millions just to answer a few questions, those answers had better be worth a lot.

Hadoop created an affordable and powerful storage area or long-term memory for big data. Affordable is the most important word. The power of Hadoop was first expressed by MapReduce programs and now through the YARN APIs, but most often as an embedded system, which is its future.

The next important part of the big data ecosystem is the bloodstream. This allows needed data to be delivered where it can make a difference. This bloodstream is complex and in the case of big data has many different parts.

The first part is sifting through the data and finding nuggets that have meaning. This sifting is an activity of much broader importance in the world of big data than in the world of data warehouses and traditional business intelligence. In the world of traditional BI, searching for important nuggets wasn’t a key activity because almost all data stored was master data or transactional data of known value. You didn’t have to look through lots of data to find something important.

Big data, however, is a far different type. Much of the data is micro-transactional, presenting evidence of every click, every call, every payment, every movement, every fluctuation. In this sort of data, it is not clear where the key information is. You have to search for it using machine learning, statistics and any other means you have at your disposal.

Once you have the important bits, the ETL process is used to combine it with other data that present a bigger picture that is more relevant to the business.

Other parts of the bloodstream for moving, distributing, replicating and synchronizing data around then come into play. The nuggets, enriched with other data, are assembled into a model for a particular purpose. This model is crucial. Its structure determines the kinds of questions that can be answered quickly and easily. Usually the idea is to create an organized collection of data that is intended to support a particular application or analysis.

Then, the brain and the eyes come into play. We often think of vision as something that happens in our eyes, but this is really not true. We don’t know what the visual field means without some amazing processing.

In the case of big data, this means asking questions and scanning the model or sometimes the whole repository for the data needed to get the answers. This is why SQL is so vital: It is a masterful way of declaring in a compact way just the data you want out of a complex collection. While it is no longer the only game in town (graph databases are also a powerful way of connecting and extracting data), SQL is still the most important method of querying data. And you don’t have to know SQL to leverage its power: having a visual way of generating SQL or SQL-like syntax expands the sphere of knowledge by making data more accessible to the masses, allowing everyone to see what big data holds for them.

Holistic Big Data Discovery

The power of all these body parts is multiplied when they all work together. The goal is to create a working organism that can sense what is happening in the world, make sense of it, and then support the right action.

Systems that take a holistic approach have more potential than those that narrowly focus on one part, especially when the company who is using the technology is not used to doing its own integration. Remember, many of the big data victories have come from companies that are rich in engineering resources. For them, combining many parts into a working organism is a core competence.

Big data vendors have realized this and are attempting to productize the integration. If you look at a system like Platfora, which is built to handle the entire body of big data, from scouring through raw big data through the assembly of the model and visualization, you have a way that one person can control the entire process. What this does is unlock the power of your super users to work without intermediation, unleashing their creativity.

In Platfora, and a few other big data technologies such as SiSense or 1010data or Qlik, a super user can sift through raw big data, find nuggets, record them in a catalog, mash them up, and then create models to support visualization. As part of this process, the data lineage is preserved so you know what data you are dealing with.

In a world in which we get excited about so many different big data technologies, I think it is now time to start thinking holistically. When is the power of a single piece of technology worth the work of having to build everything around it, with multiple experts and handoffs? How does it compare with an integrated process, controlled by a single super user?

The key question, one that is hard to answer, is when it makes sense to take a holistic approach, perhaps one with one or two weaker parts, and when it makes sense to surround a powerful piece of technology with everything and everyone else you need. This is a fascinating topic — one that those who seek to gain business value from data should play close attention to.