Data now stream from daily life: from
phones and credit cards and televisions and computers; from the
infrastructure of cities; from sensor-equipped buildings, trains, buses,
planes, bridges, and factories. The data flow so fast that the total
accumulation of the past two years—a zettabyte—dwarfs the prior record
of human civilization. “There is a big data revolution,” says Weatherhead University Professor Gary King. But it is not the quantity of data that is revolutionary. “The big data revolution is that now we can do something with the data.”
The revolution lies in improved statistical and computational
methods, not in the exponential growth of storage or even computational
capacity, King explains. The doubling of computing power every 18 months
(Moore’s Law) “is nothing compared to a big algorithm”—a set of rules
that can be used to solve a problem a thousand times faster than
conventional computational methods could. One colleague, faced with a
mountain of data, figured out that he would need a $2-million computer
to analyze it. Instead, King and his graduate students came up with an
algorithm within two hours that would do the same thing in 20 minutes—on
a laptop: a simple example, but illustrative.
New ways of linking datasets have played a large role in generating new insights. And creative approaches to visualizing
data—humans are far better than computers at seeing patterns—frequently
prove integral to the process of creating knowledge. Many of the tools
now being developed can be used across disciplines as seemingly
disparate as astronomy and medicine. Among students, there is a huge
appetite for the new field. A Harvard course in data science last fall
attracted 400 students, from the schools of law, business, government,
design, and medicine, as well from the College, the School of
Engineering and Applied Sciences (SEAS), and even MIT. Faculty members
have taken note: the Harvard School of Public Health (HSPH) will
introduce a new master’s program in computational biology and
quantitative genetics next year, likely a precursor to a Ph.D. program.
In SEAS, there is talk of organizing a master’s in data science.
“There is a movement of quantification rumbling across fields in
academia and science, industry and government and nonprofits,” says
King, who directs Harvard’s Institute for Quantitative Social Science
(IQSS), a hub of expertise for interdisciplinary projects aimed at
solving problems in human society. Among faculty colleagues, he reports,
“Half the members of the government department are doing some type of
data analysis, along with much of the sociology department and a good
fraction of economics, more than half of the School of Public Health,
and a lot in the Medical School.” Even law has been seized by the
movement to empirical research—“which is social science,” he says. “It
is hard to find an area that hasn’t been affected.”
The story follows a similar pattern in every field, King asserts. The
leaders are qualitative experts in their field. Then a statistical
researcher who doesn’t know the details of the field comes in and, using
modern data analysis, adds tremendous insight and value. As an example,
he describes how Kevin Quinn, formerly an assistant professor of
government at Harvard, ran a contest comparing his statistical model to
the qualitative judgments of 87 law professors to see which could best
predict the outcome of all the Supreme Court cases in a year. “The law
professors knew the jurisprudence and what each of the justices had
decided in previous cases, they knew the case law and all the
arguments,” King recalls. “Quinn and his collaborator, Andrew Martin
[then an associate professor of political science at Washington
University], collected six crude variables on a whole lot of previous
cases and did an analysis.” King pauses a moment. “I think you know how
this is going to end. It was no contest.” Whenever sufficient
information can be quantified, modern statistical methods will
outperform an individual or small group of people every time.
In marketing, familiar uses of big data include “recommendation
engines” like those used by companies such as Netflix and Amazon to make
purchase suggestions based on the prior interests of one customer as
compared to millions of others. Target famously (or infamously) used an
algorithm to detect when women were pregnant by tracking purchases of
items such as unscented lotions—and offered special discounts and
coupons to those valuable patrons. Credit-card companies have found
unusual associations in the course of mining data to evaluate the risk
of default: people who buy anti-scuff pads for their furniture, for
example, are highly likely to make their payments.
In the public realm, there are all kinds of applications: allocating
police resources by predicting where and when crimes are most likely to
occur; finding associations between air quality and health; or using
genomic analysis to speed the breeding of crops like rice for drought
resistance. In more specialized research, to take one example, creating
tools to analyze huge datasets in the biological sciences enabled
associate professor of organismic and evolutionary biology Pardis Sabeti,
studying the human genome’s billions of base pairs, to identify genes
that rose to prominence quickly in the course of human evolution,
determining traits such as the ability to digest cow’s milk, or
resistance to diseases like malaria.
Read more here
No comments:
Post a Comment