9 big data pain points - Andrew C. Oliver

Do enough Hadoop and NoSQL deployments, and the same problems crop up again and again. It's time for the industry to nail them sooner rather than later

Sometimes, there's a big hole in the side of the ship, and the industry decides to wait until the ship starts sinking in hope of selling lifeboats.

At other times, less severe flaws resemble the door in my downstairs bathroom, which opens only if you turn the handle one direction, not the other. I’ll fix it one day, although I've said that for 12 years or so.

I can count nine issues confronting the big data business that fall at either extreme ... or somewhere in between.

Big data pain point No. 1: General-use GPU programming

CPUs are still kind of expensive, or at least compared to GPUs. If better standards and fewer obscure drivers were developed for GPUs, a whole marketplace would open up. For now, the fact that GPUs cost a lot less is outweighed by the fact it is much harder to program them and virtually impossible to do so without tying yourself to a very specific model.

This is the kind of situation where someone does the hard work of writing something that looks like ODBC or JDBC and convinces AMD or Nvidia that the market is bigger than graphics cards alone. Suppose you had a general binding for Spark that you didn’t have to think real hard about; suddenly, people would start building “GPGPU” clusters with reckless abandon.

People are already working on this. But to get the marketplace going, you need at least two ruthless competitors -- AMD and Nvidia plus maybe Intel -- to cooperate, one of whom thinks secrecy is the path to competitive success. Gosh, I want one!

Big data pain point No. 2: Multiple workload scaling

You have Docker. You have Yarn. You have Spark, Tez, MapReduce, and whatever comes next. You also have different pools with different priorities and things that come up. You can “autoscale” on a PaaS if you’re deploying, say, a Java war file, but if you were hoping to do this with Hadoop that's still special.

Plus, what about the interaction between storage and processing? Sometimes you need to temporarily expand and distribute storage. I should be able to run my “end of month” batch and have Docker images autodeploy all over the place. Then when I stop doing so much of that, and the system should undeploy them, then deploy whatever else needs the resources. The application or workload should put no effort whatsoever into this.

This is not where we are today. I hope you like writing Chef recipes and scripting.

Big data pain point No. 3: Even worse, NoSQL deployment

Why can I image some Linux boxes with ssh and sudo, point Ambari at them, and install something as complex as Hadoop, but I still have to put actual effort into this for MongoDB and most other databases? Sure, I can write Chef recipes, but why should I have to?

Big data pain point No. 4: Query analyzer/fixer

When I worked at JBoss I did a lot of Hibernate and later JPA/EJB3 tuning. This mostly consisted of looking at the log, finding places where there were n+1-style queries, turning those into joins, and removing your stupid cache configuration that made it worse.

Other times, it was the opposite: You joined every damned table in the system and it took forever to return. Sometimes, on more complicated systems I’d look at Oracle Enterprise Manager and its analysis, which turned out reports in a bizarre language based on gibberish that often hinted at these problems. However, I was capable of seeing two tables always used together and identify this pattern. I even considered coding it.

Now, when I tune NoSQL systems, I see different variations of this same problem: Too many trips versus overly complex queries, or your index doesn’t match your where clause (range merges). In short, we’ve done a lot to optimize how bad or complex queries are run, but we've never questioned the queries themselves beyond developer training. It seems like you could build this in and have it say: “Hi, you sent these queries, I think they should look like this …”

Oh well, it's a living doing something that could be automated, I guess. All I can say is I’m glad I’ve moved higher up on the food chain so that I don't have to do that work anymore.

Big data pain point No. 5: Distributed code optimization

I expect I’ll start seeing the Spark version of No. 4 with the uber-function and too many little functions or something. In compilers, you can write optimizers that detect the likes of nondependent operations inside of loops and automatically pull out and parallelize them. I’ve yet to see something significant here in distributed computing. The “data scientist” writes crummy Python that doesn’t really distribute the problem well and needlessly wastes memory. Then someone smart has to come behind him, understand what he or she was trying to do, and hand-optimize it.

The thing is, these problems taste and smell a lot like any of the techniques in your favorite compiler theory book. I guess the next step is for, say, Zeppelin or maybe Spark itself to go fix your crummy code and make it play nicely with the cluster.

Big data pain point No. 6: De-distributor

I admit, my first introduction to Hadoop was typing select count(*) from somesmalltable in Hive. I thought, “Gosh, this sucks.” You can look at some problems and know they won’t distribute well, while some you barely need any additional data (such as a row count) to see there's no point in distributing them. Frequently, these are parts of larger jobs (such as lookup tables), but whether it's Hive or Spark or HDFS or YARN, the entire assumption is that all problems are distributed. Some need to be as not distributed as possible because they’re inherently faster that way. I’m talking dumb things like select * from thousandrowtable kicking off a MapReduce job.

Big data pain point No. 7: Machine learning mapping

There are lots of instances where I can tell you, “Oh, that's a clustering problem” or “That’s projection” or whatever. But no one seems to have done the hard work of mapping out the common parts of a business, describing the problems, and mapping that to a description of the algorithms you should use.

Outside of finance, maybe 10 to 30 percent of any business is actually unique to that industry -- that is, I can map the sales, marketing, inventory, labor, and so on to a general model, then describe the algorithms to use. This work would not only change how we do business, but would dramatically expand the market. Think of it as design patterns for big data, only with a bigger emphasis on the business side.

Big data pain point No. 8: Security

First off, why oh why is Kerberos the only way to get single sign-on? There's no Kerberos in the cloudy Web. (OK, people do that too, but there's also a place on Reddit for abacus enthusiasts.)

Secondly, weird vendor competition distorts Hadoop in ways that are bad for everyone. When it comes to basic authentication and authorization, why do I need two completely different stacks that incompletely support various sections of Hadoop but not the others? Fine, compete in encryption (smaller, faster, stronger), but whether it's Ranger or Sentry or whatever, why can’t I have one access and authorization mechanism that covers all Hadoop projects? To be fair, this is worse in the NoSQL space; every two-bit “we love open source” vendor shows their love of open source by making the 100 lines or so of LDAP integration part of their “enterprise” proprietary edition.

Big data pain point No. 9: Extract, transform, load

ETL is the silent budget killer of every big data project. You have things to do, but instead you’re going to write Flume, Oozie, Pig, Sqoop, and Kettle. This is also where you'll see overage because your data is over there and it is messy. However, no one has much of a vision on how to make this more seamless. This problem is not sexy, but it is big.

What is your favorite “OMFSM fix it already” issue with the technologies in big data?