New standards for Big Data

Big data holds great promise: structured and unstructured data harvested, processes and analysed in near-real-time by organisations sifting for new opportunities or seeking to refine existing activities. Data is pouring in: from sensors, shoppers, IoT, social and more with companies investing in big-data projects, from data lakes and processing frameworks like Hadoop to analytics tools and Intel hardware.

As we pump increasing amounts of data into these systems, how we source and manage that information is becoming increasingly important.

Users can take advantage of the increased analytical computing power offered by scale-out configurations of x86 processors, but not all companies have been fastidious about data quality.

That might not sound like a problem, but reliable foundations are required in this new world and information that might not be accurate could cause problems down the line. You are a financial services firm selling insurance based on customers’ combined fit-bit and shopping data: what if you deny somebody a policy because but data you founded that decision on was somehow wrong – it became corrupted somewhere during creation, transmission, storage or analysis?

“Not much attention was paid to data,” according to Melanie Mecca, who directs data management products and services at the CMMI Institute, a Carnegie Mellon organization that focuses on best technology practice, on our new desire for data. “It was seen as the toothpaste in the tube of features, technology and automated capabilities. The data itself was never viewed as the foundation and the life blood of the organization’s business knowledge. That is why it has been neglected.”

A standard for data assurance?

One organization looking to tackle this is National Physical Laboratory (NPL), the UK’s national measurement standards laboratory and home to the country’s largest applied physics organisation. NPL is working on introducing a systematic approach to create a measurable level of confidence in data.

“We’re trying to apply our way of thinking that comes from the measurement domain to thinking about how it applies to the digital domain,” says NPL Fellow Alistair Forbes.

When thinking about big data quality, NPL looks at the four Cs: collection, connection, comprehension and confidence.

Collection means verifying the source of the data and assessing its credibility and accuracy. Taking data from an unverified source with no measurement of data quality is a bad idea. Connection looks how the data was transported and whether there was proper error correction in the event of interference.

Comprehension means ensuring the data “properly,” according to Neil Stansfield, head of the digital sector at NPL. “When we’re doing analytics, using data from lots of sources, how do we ensure that uncertainty propagation through those data sources is properly understood,” he says.

Today, the best tool for modeling that propagation is the Guide to Uncertainty in Measurement (GUM), which was developed by NPL and the International community. It describes how uncertainty propagates across different sensors and data sources, and what that means for decision making.

“If you’re trying to guide people to collect information, put it together and use it, this is how uncertainties will flow,” says Stansfield. “So when you do system level design, you can get it right first time.”

Forbes describes it as a relatively narrow guide, and NPL is doing the research to expand it beyond how uncertainty spreads, to how much there is.

“We’re going from a paradigm of uncertainty propagation to a paradigm of uncertainty quantification, which is a comprehensive assessment of where the uncertainty sources are and trying to account for them using better statistical tools,” he says.

Today, we measure the certainty of something by modeling it, which becomes more difficult as the model becomes more complex. NPL is developing a methodology to quantify the uncertainty associated with a model. It is mainly targeting the engineering domain, but also wants to address other areas from satellite imaging to life sciences.

NPL is also exploring how metadata around data quality might be stored at a machine-readable level, to make this data more accessible.

While NPL focuses on confidence in source data, others are helping to ensure that it is managed properly in scientific applications. For example, Intel – who you might normally associate with silicon engineering – has been working with a number of partners on the subject of data management.

Intel has, for example, partnered with the US Department of Energy’s National Energy Research Scientific Computing Center (NERSC) and five Intel Parallel Computing Centers (IPCCs) to create a Big Data Center (BDC). This will work on creating robust infrastructures for data management.

Tech names are also seeking sector-specific solutions: Intel this summer announced with carmaker Toyota and others a big-data, the creation of the Automotive Edge Computing Consortium. The group will work on standards, best practices and architectures for emerging mobile technologies within the car sector.

Looking further up the big data stack

Intel has been working on the stack, too, partnering with Hadoop-specialist Cloudera on open-source enterprise data management by tuning the data-crunching platform on Intel architectures. Hadoop is becoming an industry standard big-data processing platform while Intel accounts for more than 90 per cent of the global data-center market, meaning a potentially overlap between the two. Intel and Hortonworks are developing a joint roadmap, to accelerate the performance of encryption and decryption, data compression and decompression, caching and I/O-intensive workloads.

SAS, meanwhile, is working on metrics that can than help improve quality management in a big data environment. Metrics often used in client engagements include completeness, consistency and accuracy, says Ron Agresta, director of product management for data management there. “A lot of organizations will categorise what they’re looking for and the check they’re making along those lines so that they can aggregate those and roll them up to dashboards,” he says. However, he adds that one customer’s requirements for these metrics might be different to another’s depending on what they’re using the data for.

“I don’t think we’re seeing any standard way to go about data management in this current environment,” says his colleague Todd Wright, senior product marketing manager for data management at SAS. Where standards do exist for ensuring big data quality, they’re handled at a sector-specific level in heavily-regulated areas like healthcare, he adds.

“Every organization, even in the same industries, has a variety of issues that they’re trying to tackle, and especially among vendors there is no standard way of tackling these issues around big data. We’ve taken it case by case,” he says.

Frameworks for data quality and governance are also there at a higher level. Mecca’s organization publishes its Data Management Maturity (DMM). It focuses on the upper tier of our management stack, looking at the techniques that people use to ensure quality and consistency in their data.

“These are practices that people have to do,” she says, arguing that the organization didn’t come at from a technology perspective. “In terms of making decisions about the data, that’s a person process.” The DMM looks at areas such as data management (the creation of a business glossary for data and a metadata repository), data governance, and data quality.

The EDM Council also has its own framework for effectively managing big data, called the Data Management Capability Assessment Model (DCAM). “They have a circular chart with various aspects of data quality that are assessed,” says Mike Bennett, director of semantics and standards for the EDM Council.
Big data offers grand potential, for greater insight and for new business, but as more devices are connected and data is merged there grows the obvious potential for mistakes. There is missing an overall lack of standards or a consensus on management of data that could help avoid these potential issues.

As vendors and researchers apply more expertise to helping customers improve the quality and management of their data, there’s hope that common agreement will become a reality.

Indeed, that reality may become a necessity. If data is the new oil, something that’s critical to a new way of doing business, it’ll become important to not just identify any mistakes in data but also be able to trace them back to their origin. That’ll be vital for the sake of visibility and accountablity. Like Sarbanes Oxley a generation before, data standards will become a matter of business necessity: a regulatory requirement, a way of not just measuring you are compliant but of also proving it.

(INTEL)