Chasing Big Data and the Data Scientist Unicorn - Theo Priestley

Organizations are collecting more and more data every single minute of every day. It’s no secret that Facebook, for example, processes 2.5 billion pieces of content and over 500 terabytes of data each day, pulling in 2.7 billion Likes and 300 million photos per diem. Facebook also scans a whopping 105 terabytes of data each half hour.

Of course Big Data is nothing new. It was relatively enormous all the way back to the 1960′s when NASA took a shot at the moon. It shouldn’t have been a surprise when Google gave everyone the proverbial heart attack when they revealed they churn almost the entire Internet every couple of days to meet our search needs.

Despite the hype, Big Data didn’t happen all at once…data just got bigger.
So we know that companies manage, manipulate and extrapolate information from ever larger amounts of data. But what we don’t know is whether they are asking anything bigger from it all. They should be.

Limitless information, limited imagination
Analyst firm IDC says that only 0.5% of the World’s data is being analyzed. That’s a ridiculously small percentage.

A customer-focused business with Big Data in its grasp has an unparalleled source of knowledge from an increasing number of sources now; mobile data, social data, transactional data, locational data, financial data, family data, medical data, carbon footprint and consumption data.

We even have data about data in the form of log data, as Tesla showed us in rebutting the NY Times article a while ago.

What’s more, a similar increase of that information is being collected in real-time with lots of integration challenges. (But often stored very traditionally and processed in batch. What use is that for real-time operational decisions ?)

But with all this information at hand there’s a worrying trend of organizations still asking the same questions of it and receiving the same answers as before, just with a little bit more data support behind it.

If you always do what you’ve always done, you‘ll always get what you’ve always got. – Henry Ford

When you combine social + mobile + medical + financial + family + real-time + location you get quite a bit more than just demographic segmentation. Too often, that’s the apparent limit to where current thinking goes in customer service and marketing for example.

Limitless information, limited processing
There is another angle to work on here. This amount of data requires a modicum of processing power. Whilst you can move it all to the Cloud, leverage in-memory computing or leave it to another provider to churn the numbers for you there’s nothing to stop you from applying a little distributed magic and using idle processors sitting on every desk in an organization. In fact, what if AT&T managed to work out how to use idle clock time in every one of its smartphones to process its own data from its customers ?

SETI@Home famously did this by allowing 3 million users to assign their PCs and PlayStations to solve computational data from radio telescopes. It’s not such a far fetched notion for a business such as mobile provider or a bank to do exactly that via an app.

Limitless information, unlimited possibilities
Big Data actually demands of us big questions. It demands us to think bigger than what we’re currently doing. We should be asking questions we’ve always been told were impossible to answer before but the starting point is not with what data you hold, start with the important questions and work backward.

And this is where the elusive Data Scientist unicorn comes in

According to a recent survey by NewVantage, 70% of organizations surveyed plan to hire Data Scientists, and 100% of them said it’s “somewhat challenging” to hire a competent one. This was echoed recently by McKinsey too, that “by 2018, the U.S. alone may face a 50 percent to 60 percent gap between supply and requisite demand of deep analytic talent.”

But just what is a Data Scientist anyway ?

Searching the internet tells you that they are supposed to have a distinctive set of skills, aptitudes and attitudes which distinguish them from their lowly analyst counterparts. The Guardian claims they are “

the highly educated experts who operate at the frontier of analytics, where data sets are so large and the data so messy that less-skilled analysts using traditional tools cannot make sense of them.”

Looking deeper there’s an interesting Gartner blog post by analyst Svetlana Sicular in which she heard a couple of definitions;

…a data scientist is 1) a data analyst in California or 2) a statistician under 35

It’s a matter of balance
But more importantly, Sicular makes a killer point. ”Organisations already have people who know their own data better than mystical Data Scientists…learning Hadoop is easier than learning the company’s business.”

In other words, if you have a Data Analyst employed then your search may well be over. Organizations need to look internally first and invest in their existing analyst resources, train them to stand tall on the same pedestal we seem to have placed the scientists on. As with any business, understanding capabilities that exist on the inside could well be a more cost and time effective method than searching on the outside.
It’s a matter of balance:

know what the right questions to ask are and,
know how to get the right answers

For that you need the right mix of analysts (to ask the right business driven questions) and scientists (to mine for the right data driven answers in context) who operate on the same level.

It’s may not be as simple as a case of employing one over the other, or paying over-inflated salaries to fulfil a prophesy created by a data scientist mining recruitment trend data for an analyst firm…

(Forbes)