Archived But Accessible: Retirement Planning For Your Big Data

There is a special time in the life of every piece of data when it is in its prime and at its most active. Some data might be at its peak for 20 seconds while other data might be important for a year or more.

Alas, time and changing context inevitably catches up with all data. So what are your plans for what comes next for your data? Here are some things to consider when planning what will happen to data that’s reached its golden years.

Archiving seems like the obvious answer, but what does that mean? Not long ago archiving was often used as a euphemism for backup. Organizations packed up their data and put it somewhere dark and out of the way. Retrieving anything was a nightmare and only happened when it was absolutely necessary. At many companies, that form of archive was where data went to die.

That archaic approach won’t work in an era when we value data so highly and expect access to it. We should be thinking not only of our data’s immediate analytical value, but also look ahead at how to preserve the value of data as it ages. That requires a new way of thinking about what an archive is.

Obviously it starts with moving data that is no longer part of your day-to-day operational queries to a more passive environment, a quieter place where it’s no longer accessed by hundreds or thousands of users in parallel. When you move valuable business data into an archive, you need to ensure regulatory compliance and adhere to your own policies around data governance.

Archived data needs to be immutable. Just buying hardware and storing it all in an open source environment doesn’t meet the immutability requirements because people can change data in Hadoop.

Your archive needs to provide access to data without the ability to change it or its metadata, preserving immutability for compliance purposes. Data should still able to be queried when needed in order to generate new insights.

As with any long-term care scenario, we need to be mindful of costs. Not every company can afford to maintain data indefinitely, and most data doesn’t warrant that anyway. Archives must be cost-efficient, enabling you take advantage of advances in data compression. Data that is five years old might only be accessed once in a rare while. An optimal solution therefore strikes a balance between economical retention and the ability to still query and join data from a table or traditional data warehouse. While our data enjoys its new retirement home, we want to be able to adopt new technologies to achieve greater compression, shrink its footprint on disk, and reduce the cost to store it.

As alluded to previously, if we are going to the trouble to retain this data, we also want the ability to access it. When the day comes when we can make use of that venerable data, we need to be able to query it. You should be able to query the data with standard SQL interfaces and without having to code complex programs to get to the data you need. That means designing the archive system with easy access in mind.

And, inevitably, we have to develop policies to auto-expire data that has truly given all it has to offer. When data reaches a specific threshold, you want an automated process to determine what data is retained in the archive and what is let go of for good.

Planning ahead for the long-term care of your data is a responsibility and a requirement in today’s data-rich environment. If you have data in your data lake that is ready to be archived, along with old backups and tapes, there are new opportunities to give that data a more active retirement. When we need to get data out of a data warehousing or operational environment, keep in mind factors such as compliance, data portability, immutability, and cost management, and your data can continue to deliver returns for many years to come.

(Forbes)