No, you shouldn’t keep all that data forever

Modern ethos is that all data is valuable, should be stored forever, and that machine learning will one day magically find the value of it. You’ve probably seen that EMC picture about how there will be 44 zettabytes of data by 2020? Remember how everyone had Fitbits and Jawbone Ups for about a minute? Now Jawbone is out of business. Have you considered this “all data is valuable” fad might be the corporate equivalent? Maybe we shouldn’t take a data storage company’s word on it that we should store all data and never delete anything.

Back in the early days of the web it was said that the main reasons people went there were for porn, jobs, or cat pictures. If we download all of those cat pictures and run a machine learning algorithm on them, we can possibly determine the most popular colors of cats, the most popular breeds of cats, and the fact that people really like their cats. But we don’t need to do this—because we already know these things. Type any of those three things into Google and you’ll find the answer. Also, with all due respect to cat owners, this isn’t terribly important data.

Your company has a lot of proverbial cat pictures. It doesn’t matter what your policy and procedures for inventory retention were in 1999. Any legal issues you had reason to store back then have passed the statute of limitation. There isn’t anything conceivable that you could glean from that old data that could not be gleaned from any of the more recent revisions.

Machine learning or AI isn’t going to tell you anything interesting about any of your 1999 policies and procedures for inventory retention. It might even be sort of a type of “dark data,” because your search tool probably boosts everything else above it, so unless someone queries for “inventory retention procedure for 1999,” it isn’t going to come up.

You’ve got logs going back to the beginning of time. Even the Jawbone UP didn’t capture my every breath and certainly didn’t store my individual steps for all time. Sure each breath or step may have slightly different characteristics, but it isn’t important. Likewise, It probably doesn’t matter how many exceptions per hour your Java EE applications server used to throw in 2006. You use Node.js now anyhow. If “how many errors per hour per year” is a useful metric, you can probably just summarize that. You don’t need to keep every log for all time. It isn’t reasonable to expect it to be useful.

Supposedly, we’re keeping this stuff around for the day when AI or machine learning find something useful in it. But machine learning isn’t magical. Mostly, machine learning falls into classification, regression, and clustering. Clustering basically groups stuff that looks “similar”—but it isn’t very likely your 2006 app server logs have anything useful in them that can be found via clustering. The other two algorithms require you to think of something and “train” the machine learning. This means you need a theory of what could be useful and to find something useful, then train the computer to find it. Don’t you have better things to do?

Storage is cheap, but organization and insight are not. Just because you got a good deal on your SAN or have been running some kind of mirrored JBOD setup with a clustered file system doesn’t mean that storing noise is actually cheap. You need to consider the human costs of organizing, maintaining, and keeping all this stuff around. Moreover, while modern search technology is good at sorting relevant stuff from irrelevant, it does cost you something to do so. So while autumn is on the wane, go ahead and burn some proverbial corporate leaves.

It really is okay if you don’t keep it.

Source: InfoWorld Big Data