How to get your mainframe's data for Hadoop analytics

Many so-called big data — really, Hadoop — projects have patterns. Many are merely enterprise integration patterns that have been refactored and rebranded. Of those, the most common is the mainframe pattern.

Because most organizations run the mainframe and its software as a giant single point of failure, the mainframe team hates everyone. Its members hate change, and they don’t want to give you access to anything. However, there is a lot of data on that mainframe and, if it can be done gently, the mainframe team is interested in people learning to use the system rather than start from the beginning. After all, the company has only begun to scratch the surface of what the mainframe and the existing system have available.

There are many great techniques that can’t be used for data integration in an environment where new software installs are highly discouraged, such as in the case of the mainframe pattern. However, rest assured that there are a lot of techniques to get around these limitations.

Sometimes the goal of mainframe-Hadoop or mainframe-Spark projects is just to look at the current state of the world. However, more frequently they want to do trend analysis and track changes in a way that the existing system doesn’t do. This requires techniques covered by change data capture (CDC).