Tame unruly big data flows with StreamSets

Internet of things (IoT) data promises to unlock unique and unprecedented business insights, but only if enterprises can successfully manage the data flowing into their organizations from IoT sources. One problem enterprises will encounter as they try to elicit value from their IoT initiatives is data drift: changes to the structure, content, and meaning of data that result from frequent and unpredictable changes to source devices and data processing infrastructure.

Whether processed in stream or batch form, data typically moves from source to final storage locations through a variety of tools. Changes anywhere along this chain — be they schema changes to source systems, shifts in the meaning of coded field values, or an upgrade or addition to the software components involved in data production — can result in incomplete, inaccurate, or inconsistent data in downstream systems.

The effects of this data drift can be especially pernicious because they often go undetected for long periods of times, polluting data stores and subsequent analyses with low-fidelity data. Until detected, the use of this problematic data can lead to false findings and poor business decisions. When the problem is finally detected, it is usually fixed through manual data cleanup and preparation by data scientists, which adds hard costs, opportunity costs, and delays to the analysis.

StreamSets Data Collector

Using StreamSets Data Collector to build and manage big data ingest pipelines will help mitigate the effects of data drift while vastly reducing the amount of time spent cleansing data. In this article, we will walk through a typical use case of real-time data ingest of IoT sensor data into HDFS for analysis and visualization using Impala or Hive.

Without writing a single line of code, StreamSets Data Collector can ingest streaming and batch data from a large number of sources. StreamSets Data Collector can perform transformations and sanitize the data in-stream, then write to a large number of destinations. When the pipeline is placed in operation, you get fine-grained data flow metrics, detection of anomalous data, and alerting so that you can stay on top of pipeline performance. StreamSets Data Collector can run standalone or be deployed onto a Hadoop cluster, and it offers connectors to a variety of data source and destination types.

The following use case involves data generated in real time from shipping containers.

The first example of data drift manifests itself in the IoT sensors that the shipping company uses. Due to upgrades over time, the sensors in the field run one of three different firmware versions. Each revision adds new data fields and changes the schema. To derive value from that sensor data, the system we use to ingest the information must be able to handle this diversity.

Cleanse and route the data

Our pipeline reads data from a RabbitMQ system that receives MQTT messages from the sensors out in the field. We check to verify that the messages we are receiving are those we want to work with. To do so, we use a stream selector processor to specify a data rule for the incoming messages. We then use this rule to declare that all data matching the rule’s criteria is routed downstream, but anything that doesn’t match the criteria will be discarded.

streamsets fig1

We then use another stream selector to route data based on the firmware version of the device. All records matching firmware version 1 go to one path, those matching version 2 go to another, and so forth. We also specify a default catch-all rule to send any outliers to an “error” path. With modern data streams, we fully expect that the data will unexpectedly change, so we can set up graceful error handling that shunts anomalous records to a local file, a Kafka stream, or a secondary pipeline. That way we can keep the pipeline running and simultaneously reprocess data that doesn’t fit the primary intents after the fact.

streamsets fig2

Let’s start with handling data for firmware version 3, which added latitude/longitude data. Right away we want to make sure those fields exist in the data set, and the data contains valid values. Because the location field is a nested structure, we want to flatten it and eventually discard the nested data.

streamsets fig3

Similarly, firmware version 2 contains new orientation fields (raw, pitch, roll), and we can verify and sanitize it in a similar fashion.

Finally, all device versions contain temperature and humidity readings. First, we convert the data types of these readings. Temperature gets converted to a double, humidity to an integer, and the date to a Unix timestamp.

streamsets fig4

We then use a scripting processor to write some custom logic — such as to convert Fahrenheit values to Celsius. StreamSets scripting processors support Jython, Groovy, and JavaScript.

After cleansing the data (that is, routing it based on firmware version and eventual use) we send it into a couple of HDFS destinations.

Configure the destination

StreamSets natively supports a large number of data formats such as plain text, delimited, JSON, Protobuf, and Avro. In this example we will write data to a snappy compressed Avro file.

The HDFS destination is highly configurable. You can configure security as required by your enterprise policies, dynamically configure the path and location of your output files, and even choose to write multiple Cloudera CDH versions.

streamsets fig5

Once you’ve designed the pipeline, you can switch to preview mode to test and debug the data flow using a sample of the data. You can step through each processor and examine the state of the data at any stage.

streamsets fig6

For example, we see below that the data types for reading_date and temperature were converted to long and double. StreamSets will also alert you if a calculation was performed to convert the data.

streamsets fig7

You can also inject outlier or “corner case” data into the stream to see what impacts it has on your flow. Preview mode gives you an easy way to debug complex pipelines without putting the pipelines into production.

Execute the pipeline

Now we’re ready to execute the pipeline and start ingesting data into our cluster. Hit the Start button and the UI will switch to execute mode.

streamsets fig8

At this point, the StreamSets Data Collector starts ingesting data, processing it in memory, and sending data into the destination. The monitoring window at the bottom of the screen displays various real-time metrics such as how many records came in and how many were written out. You can also see how much time is spent on each processor and how much memory it consumes. These metrics and a lot more are also accessible via Java Management Extensions (JMX).

As we drop data into HDFS, we can immediately start querying Impala and running analytics, machine learning, or visualizations.

streamsets fig9

Today, IoT devices, sensor logs, web clickstreams, and other sources of important data are constantly changing as systems are tweaked, updated, or even replatformed by their owners. These changes to data content, structure, behavior, and meaning are unpredictable, unannounced, and never-ending, and they wreak havoc with data processing and analytics systems and operations. StreamSets Data Collector helps manage the constant changes in your data infrastructure, taming data drift and preserving the integrity of your data processing systems.

Arvind Prabhakar is co-founder and CTO at StreamSets, a data performance management platform. He is an Apache Software Foundation member and an Apache Project Management Committee member of the Flume, Sqoop, Storm, and MetaModel projects. Prior to StreamSets, he held engineering roles at Cloudera, Informatica, and Sun Microsystems.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data