Review: Microsoft takes on TensorFlow

Review: Microsoft takes on TensorFlow

Like Google, Microsoft has been differentiating its products by adding machine learning features. In the case of Cortana, those features are speech recognition and language parsing. In the case of Bing, speech recognition and language parsing are joined by image recognition. Google’s underlying machine learning technology is TensorFlow. Microsoft’s is the Cognitive Toolkit. 

Both TensorFlow and Cognitive Toolkit have been released to open source. Both are complex frameworks that implement many neural network and deep learning algorithms. Both present challenges to developers new to the area. Cognitive Toolkit has recently become easier to install and deploy than it was, thanks to an automatic installation script. Cognitive Toolkit may be a little easier to use than TensorFlow right now, but that is balanced by TensorFlow’s wider applicability.

Source: InfoWorld Big Data

Beware dodgy data analysis

Beware dodgy data analysis

Data science is having its 15 minutes of fame.

Everyone from John Oliver of HBO’s “Last Week Tonight” to famed election statistician Nate Silver of 538.com is getting on a soapbox about the perils of believing data-based findings that lead to seemingly crazy conclusions.

John Oliver noted one particularly dodgy finding that a glass of wine was as healthy as an hour at the gym. Another “study” supposedly proved the benefits of a chocolate diet for pregnant moms. And other studies have found that the number of suicides by hanging, strangulation and suffocation is highly correlated with U.S. spending on science, space and technology.

As those of us working in the business/data analytics field know only too well, the thing that each of these strange-but-unfortunately-true studies have in common is a failure to differentiate between data that shows correlations between variables — which is a statistician’s bread and butter — and data that establishes causality — data-tested conclusions that one thing actually causes another.

And while such confusion may not matter much if it leads to a pregnant mom eating an extra Hershey bar or two, it could be deadly to your company’s bottom line.

Source: InfoWorld Big Data

Tame unruly big data flows with StreamSets

Tame unruly big data flows with StreamSets

Internet of things (IoT) data promises to unlock unique and unprecedented business insights, but only if enterprises can successfully manage the data flowing into their organizations from IoT sources. One problem enterprises will encounter as they try to elicit value from their IoT initiatives is data drift: changes to the structure, content, and meaning of data that result from frequent and unpredictable changes to source devices and data processing infrastructure.

Whether processed in stream or batch form, data typically moves from source to final storage locations through a variety of tools. Changes anywhere along this chain — be they schema changes to source systems, shifts in the meaning of coded field values, or an upgrade or addition to the software components involved in data production — can result in incomplete, inaccurate, or inconsistent data in downstream systems.

The effects of this data drift can be especially pernicious because they often go undetected for long periods of times, polluting data stores and subsequent analyses with low-fidelity data. Until detected, the use of this problematic data can lead to false findings and poor business decisions. When the problem is finally detected, it is usually fixed through manual data cleanup and preparation by data scientists, which adds hard costs, opportunity costs, and delays to the analysis.

StreamSets Data Collector

Using StreamSets Data Collector to build and manage big data ingest pipelines will help mitigate the effects of data drift while vastly reducing the amount of time spent cleansing data. In this article, we will walk through a typical use case of real-time data ingest of IoT sensor data into HDFS for analysis and visualization using Impala or Hive.

Without writing a single line of code, StreamSets Data Collector can ingest streaming and batch data from a large number of sources. StreamSets Data Collector can perform transformations and sanitize the data in-stream, then write to a large number of destinations. When the pipeline is placed in operation, you get fine-grained data flow metrics, detection of anomalous data, and alerting so that you can stay on top of pipeline performance. StreamSets Data Collector can run standalone or be deployed onto a Hadoop cluster, and it offers connectors to a variety of data source and destination types.

The following use case involves data generated in real time from shipping containers.

The first example of data drift manifests itself in the IoT sensors that the shipping company uses. Due to upgrades over time, the sensors in the field run one of three different firmware versions. Each revision adds new data fields and changes the schema. To derive value from that sensor data, the system we use to ingest the information must be able to handle this diversity.

Cleanse and route the data

Our pipeline reads data from a RabbitMQ system that receives MQTT messages from the sensors out in the field. We check to verify that the messages we are receiving are those we want to work with. To do so, we use a stream selector processor to specify a data rule for the incoming messages. We then use this rule to declare that all data matching the rule’s criteria is routed downstream, but anything that doesn’t match the criteria will be discarded.

streamsets fig1

We then use another stream selector to route data based on the firmware version of the device. All records matching firmware version 1 go to one path, those matching version 2 go to another, and so forth. We also specify a default catch-all rule to send any outliers to an “error” path. With modern data streams, we fully expect that the data will unexpectedly change, so we can set up graceful error handling that shunts anomalous records to a local file, a Kafka stream, or a secondary pipeline. That way we can keep the pipeline running and simultaneously reprocess data that doesn’t fit the primary intents after the fact.

streamsets fig2

Let’s start with handling data for firmware version 3, which added latitude/longitude data. Right away we want to make sure those fields exist in the data set, and the data contains valid values. Because the location field is a nested structure, we want to flatten it and eventually discard the nested data.

streamsets fig3

Similarly, firmware version 2 contains new orientation fields (raw, pitch, roll), and we can verify and sanitize it in a similar fashion.

Finally, all device versions contain temperature and humidity readings. First, we convert the data types of these readings. Temperature gets converted to a double, humidity to an integer, and the date to a Unix timestamp.

streamsets fig4

We then use a scripting processor to write some custom logic — such as to convert Fahrenheit values to Celsius. StreamSets scripting processors support Jython, Groovy, and JavaScript.

After cleansing the data (that is, routing it based on firmware version and eventual use) we send it into a couple of HDFS destinations.

Configure the destination

StreamSets natively supports a large number of data formats such as plain text, delimited, JSON, Protobuf, and Avro. In this example we will write data to a snappy compressed Avro file.

The HDFS destination is highly configurable. You can configure security as required by your enterprise policies, dynamically configure the path and location of your output files, and even choose to write multiple Cloudera CDH versions.

streamsets fig5

Once you’ve designed the pipeline, you can switch to preview mode to test and debug the data flow using a sample of the data. You can step through each processor and examine the state of the data at any stage.

streamsets fig6

For example, we see below that the data types for reading_date and temperature were converted to long and double. StreamSets will also alert you if a calculation was performed to convert the data.

streamsets fig7

You can also inject outlier or “corner case” data into the stream to see what impacts it has on your flow. Preview mode gives you an easy way to debug complex pipelines without putting the pipelines into production.

Execute the pipeline

Now we’re ready to execute the pipeline and start ingesting data into our cluster. Hit the Start button and the UI will switch to execute mode.

streamsets fig8

At this point, the StreamSets Data Collector starts ingesting data, processing it in memory, and sending data into the destination. The monitoring window at the bottom of the screen displays various real-time metrics such as how many records came in and how many were written out. You can also see how much time is spent on each processor and how much memory it consumes. These metrics and a lot more are also accessible via Java Management Extensions (JMX).

As we drop data into HDFS, we can immediately start querying Impala and running analytics, machine learning, or visualizations.

streamsets fig9

Today, IoT devices, sensor logs, web clickstreams, and other sources of important data are constantly changing as systems are tweaked, updated, or even replatformed by their owners. These changes to data content, structure, behavior, and meaning are unpredictable, unannounced, and never-ending, and they wreak havoc with data processing and analytics systems and operations. StreamSets Data Collector helps manage the constant changes in your data infrastructure, taming data drift and preserving the integrity of your data processing systems.

Arvind Prabhakar is co-founder and CTO at StreamSets, a data performance management platform. He is an Apache Software Foundation member and an Apache Project Management Committee member of the Flume, Sqoop, Storm, and MetaModel projects. Prior to StreamSets, he held engineering roles at Cloudera, Informatica, and Sun Microsystems.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data

Apache Mesos users focus on big data, containers

Apache Mesos users focus on big data, containers

Mesosphere, the main commercial outfit behind the Apache Mesos datacenter and container orchestration project, has taken a good look at its user base and found that they gravitate toward a few fundamental use cases.

Survey data released recently by Mesosphere in the “Apache Mesos 2016 Survey Report,” indicates that Mesos users focus on running containers at scale, using Mesos to deploy big data frameworks, and relying heavily on the core tool set that Mesos and DC/OS provide rather than using substitutes.

We got this contained

Created in 2009, Mesos was built to run workloads of all types and sizes across clusters of systems. DC/OS, released back in 2015 by Mesosphere, automates the deployment, provisioning, and scaling of applications with Mesos as the underlying technology. Thus, it casts Mesos as a commodity similar to Docker, which offers ease in working with long-standing containerization techniques.

The Mesosphere survey doesn’t cover a very large sample of users — fewer than 500, with 63 percent of those surveyed running Mesos for less than a year. Deployments are also modest — the overwhelming majority are fewer than 100 nodes — and by and large favor generic software/IT industry settings. Retail, e-commerce, telecom, and finance made up about 19 percent of the total combined.

Among the workloads deployed in Mesos, the largest slice (85 percent) covers containers and microservices, with 62 percent of all users deploying containers in production. Containers have long been a major part of Mesos’ and DC/OS’s focus, but Mesos sets itself apart from other container projects by providing a robust solution to container management, including native support for GPU-powered applications.

Do it yourself

The second biggest slice of the pie is data-centric applications. No prizes for guessing the top entry in that category: Apache Spark (43 percent of users), followed by other major big data infrastructure components like the Kafka messaging system (32 percent), the Elastic search system (26 percent), and the Cassandra NoSQL database (24 percent). Hadoop is in the mix as well, but only at 11 percent.

If there’s a takeaway to be found, it’s that specific solutions like Spark demonstrate more immediate payoffs than general solutions like Hadoop, especially when projects like DC/OS make them easier to deploy.

The survey also makes clear that Mesos users have historically put together projects themselves, but they like the idea of having the option to not have to. Of those who use Mesos, few currently do so with DC/OS’s automated deployment. Only 26 percent of those surveyed are running it in a production context, with another 12 percent “piloting for broader deployment.” That implies that many existing Mesos-powered deployments are hand-built.

However, newly minted Mesos users are going straight to DC/OS to get their Mesos fix. Eighty-seven percent of users who started with Mesos in the past six months did so through DC/OS. Thus, it’s safe to assume as DC/OS becomes more widely used and Mesos continues to evolve (it recently hit a 1.0 release), DC/OS will become the predominant preference to deploy both Mesos and all the apps that run with it.

It’s important to think about Mesos and DC/OS as complementary technologies to the rest of the container world, not total replacements for it. Kubernetes, for instance, can run in Mesos (and 8 percent of the respondents do use Kubernetes somewhere, according to the survey). Rather than eclipsing such arrangements outright, it’s more likely that DC/OS and Mesos will provide a more convenient option to build with them.

Source: InfoWorld Big Data

Redis module speeds Spark-powered machine learning

Redis module speeds Spark-powered machine learning

In-memory data store Redis recently acquired a module architecture to expand functionality. The latest module is a machine learning add-on that accelerates delivery of results from trained data rather than training itself.

Redis-ML, or the Redis Module for Machine Learning, comes courtesy of the commercial outfit that drives Redis development, Redis Labs. It speeds the execution of machine learning models while still allowing those models to be trained in familiar ways. Redis works as an in-memory cache backed by disk storage, and its creators claim machine learning models can be executed orders of magnitude more quickly with it.

The module works in conjunction with Apache Spark, another in-memory data-processing tool with machine learning components. Spark handles the data-gathering phase, and Redis plugs into the Spark cluster through the pre-existing Redis Spark-ML module. The module generated by Spark’s training is then saved to Redis, rather than to an Apache Parquet or HDFS data store. To execute the models, you run the queries on the Redis-ML module, not Spark itself.

In the big picture, Redis-ML offers speed: faster responses to individual queries, smaller penalties for large numbers of users making requests, and the ability to provide high availability of the results via a scale-out Redis setup. Redis Labs claims the prediction process shows “5x to 10x latency improvement over the standard Spark solution in real-time classifications.”

Another boon is specifically for developers, as Redis-ML interoperates with Scala, JavaScript (via Node.js), Python, and the .Net languages. Models “are no longer restricted to the language they were developed in,” states Redis Labs, but “can be accessed by applications written in different languages concurrently using the simple [Redis-ML] API.” Redis Labs also claims the resulting trained model is easier to deploy, since it can be accessed through said APIs without custom code or infrastructure.

Accelerating Spark with other technologies isn’t a new idea. Previously, the idea was to speed up the storage back ends that Spark talks to. In fact, Redis’ engineers herald it as one such solution. Another project, Apache Arrow, speeds Spark execution (and other big data projects) by transforming data into a columnar format that can be processed more efficiently.

Redis Labs is pushing Redis as a broad solution to these problems, since its architecture (what its creators call a “structure store”) permits more free-form storage than competing database solutions. Redis VP of Product Management Cihan Biyikoglu noted in a phone interview that other databases attempt to adapt data types to the problems at hand; Redis, by contrast, instead of “shackling [you] to one data model, type, or representation,” allows “an abstraction that can house any type of data.”

If Redis Labs has a long-term plan, it’s to inch Redis toward becoming an all-in-one solution for machine learning — to provide a data-gathering and data-querying mechanism along with the machine learning libraries under one roof. To wit: Another Redis module, for Google’s TensorFlow framework, not only allows Redis to serve as backing for TensorFlow, but allows training TensorFlow models directly inside Redis.

Source: InfoWorld Big Data