For more than four decades, Stonebraker has been engaged in a battle to tame the digital data explosion and turn it into knowledge. Along the way, he has started 9 companies and made an indelible mark on academic research in the fields of databases and data management, serving as a member of the computer science faculty first at UC Berkeley and, since 2001, at MIT.
Earlier this year, Stonebraker was awarded the Turing Award, the highest technical honor for computer scientists. Like the Nobel Prize, it is given not for lifetime achievement but for specific and fundamental contributions to an academic discipline or sub-discipline. The award’s citation recognized him as “the inventor of many concepts that were crucial to making databases a reality and that are used in almost all modern database systems.”
Turing Award winners deliver a lecture, typically a survey of their computer science specialty or a highly technical discussion of the innovative ideas behind their academic work. But Stonebraker wanted to do something “really outside the box,” he told me on the sidelines of the 9th Annual MIT Chief Data Officer & Information Quality Symposium (MIT CDOIQ).
“I decided this was my once in a lifetime opportunity to say what I wanted,” says Stonebraker, “without reviewers telling me ‘you can’t say that.’ I wanted to say that writing system software is really hard and under-appreciated by the rest of computer science and requires a huge amount of stamina. The only people I know that manage to do it are people with a ‘Make It Happen’ mentality. In addition, by and large everything written in computer science has this very dry, stylized feel to it. There is no personality to anything that gets written. It was important to me to, at least in a small way, to personalize what I said.”
To illustrate the challenges of writing system software and the ups and downs of the life of a startup, Stonebraker weaved together in his Turing lecture two stories: The alternating trials and triumphs of the commercialization of Ingres and Postgres, his early (1970s-1990s) database breakthrough ideas, and the 3,500 miles, 59-day cross-country bike trip he took with his wife in 1988. The tandem bike they used (for the first time) “is a challenge in coordination,” Stonebraker says, but “the only way my wife would go with me was if we were welded together. I ride faster than she does and she didn’t want to ride across the country all by herself.”
Coordination and teamwork have been important to his work and Stonebraker made sure to include in his Turing lecture a slide showing all the contributors to the database projects he discussed. While he has been helped by many people in all his endeavors—academics, programmers, venture capitalists, business executives—it is Stonebraker who keeps making it happen both in academia and the business world. “Make It Happen,” for a computer science professor, could mean publishing widely-cited papers. It could mean leading the development of a new database implementation and providing it as open source software so it will be incorporated in many other databases. Or it could mean starting a company to commercialize the new ideas. Stonebraker has done all three, sometimes succeeding and other times failing, again and again going against the conventional wisdom that said “it couldn’t happen.”
In his Turing lecture, Stonebraker asked “where does ‘Make It Happen’ come from?” but ducked his own question, saying the answer was “above my pay grade.” When I pressed for more, he said: “We are all shaped by events in our early childhood. At a very early age I had to stand on my own two feet and had to make it happen.”
Adversity can lead to resilience, but Stonebraker probably also shares with other entrepreneurs a certain personality feature, a drive to “make a difference” in the real world, and to prove (mostly to himself, I think) that he is right. “The ultimate arbiter of good ideas is the commercial marketplace,” he says.
The real world is where the ideas for his papers-turned-into-startups came from. “If you are going to solve real-world problems,” says Stonebraker, “you got to talk to real people to know what they are. The rubber meets the road as opposed to the rubber meets the sky.”
Once a new idea emerges from these discussions, people outside of the commercial marketplace can also help. “Bouncing ideas off of smart and critical people,” has been important for Stonebraker and probably explains why he stayed in academia all these years.
This lengthy and fruitful mixing of academia and the real world has given rise to the Stonebraker Formula for Making a Difference which he has been perfecting since the 1970s:
- Identify new solution to a data management problem;
- Lead research project to develop a prototype;
- Publish paper(s);
- Publish software code on a public website;
- Launch startup;
- Repeat.
The data explosion over the last forty years has given Stonebraker many opportunities to apply this formula, always going through all the above steps (but not necessarily in the same order), sometimes trying to tackle the same persistent problem again when his previous solutions failed. He rightly dismisses big data as “a marketing buzzword,” but his entire career has been driven by the data tsunami (to use a marketing cliché) and the increasingly diverse needs of “real people” to tame the data and make it work for them.
Until the mid-1990s, the only market requiring databases, Stonebraker observes, was “business data processing, basically operational transaction management. Then data warehouses came on the scene when the retail guys started to put historical sales data into a data warehouse.”
The marketing buzzword that started the
big data avalanche, in my opinion, was “data warehouse.” Just like “big
data” today, it reflected a new attitude by business executives towards
computer-generated data. All of a sudden, it became imperative to keep
data in storage for a longer time (instead of deleting it) and to mine
the data for new business insights. The culprit was not only Moore’s Law
(today’s “Cloud Computing”
was the 1990s “Client/Server Computing”), reducing the cost of storing
and processing data, but also competitive pressures that drove
enterprises to start looking at “business data processing” as a
potential input to any type of business decision (e.g., what can we
offer a specific segment of our customers), not just decisions that were
accounting-related (e.g., how did the business do last month).
What started in specific industries has spread, within a decade, to all sectors of the economy: “Since 2000, basically the entire world has realized that their data management problem is the same as the business data processing problem. It’s gratifying that the rest of the world has a data management problem which is roughly the same as the one we have always been working on,” says Stonebraker.
He was already working on the first example of such special-purpose database, launching Vertica in 2005 (acquired by HP in 2011) to make data warehouses work faster, taming larger quantities of data by organizing it in columns rather than rows, as in traditional databases.
“Today’s legacy database vendors are all the same and their software is good for nothing,” said Stonebraker in his Turing lecture, arguing, as he has done for over a decade, that “one size fits none.” When everybody started talking about “big data” in the late 2000s, he was ready to define its future with three ideas, three research projects, and three startups.
The first was a solution to what Stonebraker calls “Big Velocity” or “drinking from a firehose.” Those trying to drink are the increasing number of enterprises coping with data arriving at very high speed (e.g., Wall Street market feeds). This became his next startup, VoltDB, launched in 2009. This future of big data is all about exploiting the rapidly decreasing cost of computer memory, keeping the data there to be available at much greater speeds than when it is fetched from a disk drive.
Another problem Stonebraker and his collaborators are trying to solve is what he calls “big analytics on big volumes of data.” The growing need and desire for running complex analytics on the growing volumes of data lead to an “array database” solution that supports sophisticated statistical procedures that cannot be performed efficiently with the table-based, traditional databases.
The resultant startup, Paradigm4, was launched in 2010. This future of big data is all about replacing business intelligence with data science and business analysts with data scientists. The example Stonebraker uses to illustrate this prediction is the retail business analyst producing a report on what sold a week before and the week after a snow storm. The retail data scientist, in contrast, comes up with a predictive model, telling business executives what to expect under different weather conditions.
The third future of big data is data integration, or the
“Big Variety” problem, what Stonebraker likes to call “the 800-pound gorilla in
the corner.” The problem dates back to the emergence of data warehouses in the
1990s and the need to “clean” and integrate the data coming from a number of
data sources, making sure it conforms to a global data dictionary (e.g.,
“salary” in one data source is a synonym for “wages” in another). The process
of data integration invented then, Extract-Transform-Load (ETL), is still used
today. But it doesn’t scale, argues Stonebraker, failing when you try to
integrate data from thousands of data sources, increasingly a “business as
usual” reality for many enterprises trying to tap the abundance of public
sources now available on the Web, to say nothing of what’s to come with the
emergence of the Internet of Things and the yet-to-emerge new-data-generating
technology.
“The trouble with doing global upfront data models is that
no one has figured out how to make them work,” says Stonebraker. “The only
thing you can do is put the data together after the fact.” The solution is a
mix of automated machine learning and the crowdsourcing of domain experts and
the resultant startup, Tamr,
was launched in 2013.
Preserving data silos is also about the future of the business. “Agility is going to be crucial to successful enterprises,” says Stonebraker, “and I don’t see a way to do that realistically without decomposing into independent business units. The minute you do that you either anoint a Chief Data Officer to keep everybody from diverging or you say, ‘look, run as fast as you can.’ I would err on the side of agility rather than standardization.”
Tamr is his “fourth attempt at doing data integration,” Stonebraker says, “and I think we finally got it right.” Not getting it right happens when you make it happen. All of Stonebraker’s solutions involve some innovative take on known technological tradeoffs, all balanced against the cost of not just the technology but also the people and processes around it. The solution may not get one or more of the components right or will incur unforeseen costs when it is implemented in the real world. Or the timing could be off. “If you are too late, you are toast, if you are too early you are toast,” says Stonebraker. “There’s a lot of serendipity involved. You have to guess the market and lead it.”
Is there a future beyond the future defined by the three startups Stonebraker is currently involved with? “Right now I’m not interested in starting any more companies,” Stonebraker says flatly. But then he adds: “If I had more bandwidth, it would be what I’m working on at MIT right now, what we call Polystores.”
Again, this is an age-old problem, tackled before with the not-too-successful concept of “federated” databases. Today, it’s an extension and expansion (in my opinion) of the Big Variety problem, what happens after the data has been “curated” (cleaned and integrated). Following his strong convictions about the advantages of special-purpose databases and given the proliferation of not just sources of data but also data types, Stonebraker suggests that “it makes sense to load the curated data into multiple DBMSs. For example, the structured data into an RDBMS, the real-time data into a stream processing engine, the historical archive into an array engine, the text into Lucene, and the semi-structured data into a JSON system.”
This future is all about reducing the complexity of applications that are deployed over multiple, special-purpose database engines. “If your application is managing what you want to think of as a single database which is in fact spread over multiple engines,” says Stonebraker, “with different data models, different transaction systems, different everything, than you want a next-generation federation mechanism to make it as simple as possible to program.”
He adds: “That would be the thing I
would look to commercialize if I had the energy…” Quickly correcting
himself, Stonebraker says: “Excess bandwidth, I have a lot of energy.”
Indeed. Also the resilience to make it happen and the passion for making a difference.
No comments:
Post a Comment