Know when your big data is telling big lies

Data scientists use statistical analysis tools to find non-obvious patterns in deep data. But they know the universe is full of spurious correlations. Big data simply intensifies the problem.

Because, as the range of sources and the diversity of predictors continues to grow, the number of relationships that can potentially be modeled begins to approach infinity. As David G. Young pointed out, “predictive variables sometimes aren’t ….We’ve all seen variable interactions that change the significance, curvature, and even the sign of an important predictor.”

Thus, if you’re looking for a particular correlation in your data, you can probably find it if you’re clever enough to combine only the right data, specify only the right variables, and analyze at using only the right algorithm. Once you’ve hit on the right combination of modeling decisions, the patterns you seek may pop out like a genie from Aladdin’s lamp.

Yet the fact that you’ve supposedly discovered this correlation doesn’t mean it actually exists in the underlying real-world domain you’re investigating. It may simply be a figment of your specific approach to modeling the data you have at hand. You may have no fraudulent intent, and you may otherwise adhere to standard data-scientific methodologies, but you may choose to go no further if it appears you’ve already struck the pay dirt insight you were seeking.