-->

Pages

Wednesday, 22 July 2015

NSA 'NiFi' Big Data Automation Project Out In The Open - Adrian Bridgwater

The National Security Agency (NSA) is not perhaps known for its openness and willingness to share. That’s not what it does, essentially — and of course not why it exists. That being said, it appears that the core spirit of open source collaborative software application development is very much recognized by ‘the agency’, just as it is in any other commercial business. Community code evolution benefits all (including the NSA) if the principals of natural selection hold, which they do.

After all, the NSA is widely known to be a user of technologies from SAP, IBM, Oracle and (probably) every other vendor you could care to name. Why shouldn’t the organization develop technologies, contribute them to the community and then continue to update its own software with open contributions to a core code base that have been shown to be useful, secure, robust, productive and so on?

NSA Technology Transfer Program
One such NSA project was ‘Niagarafiles’, which today is known as NiFi. It exists as an automation tool that acquires and delivers data across enterprise systems in real time. NiFi was submitted to The Apache Software Foundation (ASF) in November 2014 as part of the NSA Technology Transfer Program.
People (well, software architects, developers and systems engineers) have been trying to automate the flow of data inside of (and between) computer systems for decades. Project NiFi set out to address what were understood to be ‘critical gaps’ in traditional systems where other solutions lacked:
  • sufficient security,
  • interactivity,
  • scalability and,
  • data lineage/provenance i.e. a data lifecycle measure detailing data’s origins & why, how and where it moves over a period of time.
How open is Niagarafiles (now NiFi) to flow (get it? waterfall flow of data from one place to another) now it resides under the auspices of the The Apache Software Foundation? Well the ASF is an all-volunteer developer, stewards and incubator organization that currently oversees more than 350 open source projects and initiatives — so the answer is yes, quite open.
Now known formally as Apache™ NiFi™, the project has recently graduated from Apache Incubator status to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.
“We took a project with more than eight years of development in a closed source environment and transitioned it to a very open and collaborative space,” said Joe Witt, vice president of Apache NiFi. “How easy that transition was speaks volumes to the effectiveness of the Incubator process and the community around Apache in general.”
Go with the flow (-based programming)
Based on the concepts of Flow-Based Programming, NiFi features a user interface and fine-grained data provenance tools. The interface allows users to intuitively understand and interact with the data flow directly in the browser, promoting faster and safer iteration.
According to J Paul Morrison, in computer programming, Flow-Based Programming (FBP) is a programming paradigm that uses a ‘data factory’ metaphor for designing and building applications. “FBP defines applications as networks of ‘black box’ processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. FBP is thus naturally component-oriented,” he writes.
The data provenance features allow the user to see how an object flowed through the system, replay it and visualize what happened to it before and after key stages, thereby simplifying data flows that are often large, complex directed graphs involving transformations, forks, joins etc.
“NiFi’s user interface, robust security features, and powerful data provenance offer a set of capabilities for solving the challenges of managing distributed systems,” said Rob Bearden, CEO of Hortonworks. “We are proud NiFi participants and congratulate the NiFi community on becoming a top-level Apache project.”

In addition, NiFi uses a component based extension model to add capabilities to complex dataflows. Out of the box NiFi has several extensions for dealing with file-based dataflows such as FTP, SFTP, and HTTP integration as well as integration with HDFS. Finally in the feature set here, NiFi’s has a web-based interface for designing, controlling and monitoring a dataflow.


Why all this matters
The notion of big data has arguably garnered more interest from the business community than it has from the technologists who implement its related technologies. As the ‘business-technical’ audience (and some of the non-technical) becomes more and more comfortable with big data, then it is only natural to start going deeper and addressing issues like big data lineage/provenance. Now you know what big data is, don’t you want to know where it comes from and how it lives in the dataflow lifecycle?

No comments:

Post a Comment