When people talk about the US National Security Agency/Central Security Service (NSA),
the talk usually centers on privacy, with good reason. Still, it’s not
the only subject worth discussing. The volume of data collected by the
NSA and the associated costs make it the ultimate in Big Data case
studies. What can it tell us about data and business? What can it tell
us about business risk and the potential benefits and consequences of
Big Data investments?
The agency’s exact budget is a government secret, but estimates put it around $10 billion per year.
Although not all of that is devoted to surveillance, it’s reasonable to
conclude that something in the ballpark of $5 billion goes to fund NSA
data gathering each year. This may not be the clear-cut biggest Big Data
application (Google’s revenue
was $66 billion last year, for example), but it’s substantial, focused
and paid for by the public. We ought to discuss what we’re getting for
the money.
The budget is not the only cost of any Big Data program. Data
gathering and analysis has an impact on public perception and everyday
business practices. Do it wrong, and you could run into a lot of costs
you never expected. NSA programs have led to costs that the government
and public may not have anticipated: correcting functional problems, lost business to US companies, additional security costs
to US individuals and businesses seeking to protect private data and
the lost influence due to damaged credibility of US government and
businesses.
Spies have always depended on communication surveillance to obtain
information. Stealing documents, listening in on conversations and
cracking the codes of secret messages are basics of the profession.
Electronics have been part of the mix for decades: the British used an
elaborate electronic surveillance system to listen in on captured German officers during the 1940s. What’s new is the volume and breadth of information gathered.
Communication surveillance is a major part of the NSA’s mission (paired with protecting sensitive US communications). Years before Edward Snowden leaked details of the NSA’s mass surveillance of US citizens, Evan Coldewey of TechCrunch reported “NSA to store yottabytes of surveillance data in Utah megarepository”,
though that figure was quickly challenged and a later update revised
the figure to “not so much”. While Coldewey, writing in 2009, may have
been a little off-base on the quantity, he was right on target when he
said the purpose was to store data from extensive surveillance programs.
In 2012, James Bamford of Wired placed the cost of building that data
repository at $2 billion, and quoted an unnamed NSA official stating, “Everybody’s a target; everybody with communication is a target.”
Everybody’s a target. There’s the thing about Big Data. When you
collect heaps and heaps of data, you may expect you’ll know all about
everybody, but in practice, it may not end up that way.
When I was researching data sources for my book, Data Mining for Dummies,
I gathered data on myself from several providers. These sources offer a
lot of personal information. They can tell you, for example, that I’m
single, a fan of gardening and aerobics, a pet owner, and a regular user
of American Express and Discover cards. They can tell you my income and
what month my insurance payment is due. What a lot of detail! Think of
what you could do with information like that. But you won’t get the
results you want, because every bit of that information is wrong.
When the NSA obtains communications data, it has advantages that you
do not. It can get communication data directly from the source, and
that’s good behavioral data rather than self-reported or other secondary
sources, which are consistently of inferior quality. But huge volumes
of data come with huge problems. The data management burden is
stupendous, and most of that data is irrelevant to the intended purpose.
Most people, and most communications, are not involved in government
spying, terrorism or other crimes of interest to the NSA.
Because Big Data sources usually are not specific to any particular
application, they are not necessarily the best data resources for
solving any particular problem. A small volume of data, carefully
collected for relevance and quality may offer more power.
So what do the NSA’s Big Data programs provide us in return for our money?
Senator Dianne Feinstein, in a 2013 Wall Street Journal op-ed, said we’re getting a lot.
“Working in combination, the call-records database and other NSA
programs have aided efforts by U.S. intelligence agencies to disrupt
terrorism in the U.S. approximately a dozen times in recent years,
according to the NSA. This summer, the agency disclosed that 54
terrorist events have been interrupted — including plots stopped and
arrests made for support to terrorism. Thirteen events were in the U.S.
homeland and nine involved U.S. persons or facilities overseas.
Twenty-five were in Europe, five in Africa and 11 in Asia.”
But not everyone shares that view of the results. Some claim the agency is simply overwhelmed with data.
When Senator Feinstein told us that terrorist events were stopped and
arrests made in the US, I found myself wondering why she wasn’t talking
about convictions. I wondered why all that data wasn’t adequate to
prevent the 2013 Boston Marathon bombing.
Traditional research methods and resources sometimes produce better
results than massive data sources. The successful hunt to find Osama Bin
Laden, a man with considerable resources and motivation not to be
found, was done the old-fashioned way. Trained analysts,
working with documents and other sources thoughtfully researched the
target over a long time. It was unglamorous work, and not highly
appreciated during much of the time it went on. In early 2001, similar
techniques provided warnings of a threat months before the attacks of September 11.
We talk a lot about privacy implications of Big Data, as we should, but
we don’t talk much about the costs and the quality of the results. As a
statistician and data miner, I appreciate the value of data analysis,
but also appreciate its limits. When we invest in data, whether in
government, business or any aspect of life, we ought to put serious
thought and discussion into what value we’re getting for out money and
effort.
Have your say in the post comments section below
No comments:
Post a Comment