After all, the NSA is widely known to be a user of technologies from SAP, IBM, Oracle and (probably) every other vendor you could care to name. Why shouldn’t the organization develop technologies, contribute them to the community and then continue to update its own software with open contributions to a core code base that have been shown to be useful, secure, robust, productive and so on?
NSA Technology Transfer Program
One such NSA project was ‘Niagarafiles’, which today is known as NiFi. It exists as an automation tool that acquires and delivers data across enterprise systems in real time. NiFi was submitted to The Apache Software Foundation (ASF) in November 2014 as part of the NSA Technology Transfer Program.
People (well, software architects, developers and systems engineers) have been trying to automate the flow of data inside of (and between) computer systems for decades. Project NiFi set out to address what were understood to be ‘critical gaps’ in traditional systems where other solutions lacked:
- sufficient security,
- interactivity,
- scalability and,
- data lineage/provenance i.e. a data lifecycle measure detailing data’s origins & why, how and where it moves over a period of time.
Now known formally as Apache™ NiFi™, the project has recently graduated from Apache Incubator status to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.
“We took a project with more than eight years of development in a closed source environment and transitioned it to a very open and collaborative space,” said Joe Witt, vice president of Apache NiFi. “How easy that transition was speaks volumes to the effectiveness of the Incubator process and the community around Apache in general.”Go with the flow (-based programming)
Based on the concepts of Flow-Based Programming, NiFi features a user interface and fine-grained data provenance tools. The interface allows users to intuitively understand and interact with the data flow directly in the browser, promoting faster and safer iteration.
According to J Paul Morrison, in computer programming, Flow-Based Programming (FBP) is a programming paradigm that uses a ‘data factory’ metaphor for designing and building applications. “FBP defines applications as networks of ‘black box’ processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. FBP is thus naturally component-oriented,” he writes.The data provenance features allow the user to see how an object flowed through the system, replay it and visualize what happened to it before and after key stages, thereby simplifying data flows that are often large, complex directed graphs involving transformations, forks, joins etc.
“NiFi’s user interface, robust security features, and powerful data provenance offer a set of capabilities for solving the challenges of managing distributed systems,” said Rob Bearden, CEO of Hortonworks. “We are proud NiFi participants and congratulate the NiFi community on becoming a top-level Apache project.”
In addition, NiFi uses a component based extension model to
add capabilities to complex dataflows. Out of the box NiFi has several
extensions for dealing with file-based dataflows such as FTP, SFTP, and HTTP
integration as well as integration with HDFS. Finally in the feature set here,
NiFi’s has a web-based interface for designing, controlling and monitoring a
dataflow.
Why all this matters
The notion of big data has arguably garnered more interest from the business community than it has from the technologists who implement its related technologies. As the ‘business-technical’ audience (and some of the non-technical) becomes more and more comfortable with big data, then it is only natural to start going deeper and addressing issues like big data lineage/provenance. Now you know what big data is, don’t you want to know where it comes from and how it lives in the dataflow lifecycle?
No comments:
Post a Comment