How IBM's Watson will change cybersecurity

How IBM's Watson will change cybersecurity

IBM captured our imaginations when it unveiled Watson, the artificial intelligence computer capable of playing—and winning—the “Jeopardy” game show. Since then, Big Blue has been introducing Watson’s analytics and learning capabilities across various industries, including health care and information security.

Cognitive security technology such as Watson for Cybersecurity can change how information security professionals defend against attacks by helping them digest vast amounts of data. IBM Security is currently in the middle of a year-long research project working with eight universities to help train Watson to tackle cybercrime. Watson has to learn the “language of cybersecurity” to understand what a threat is, what it does, and what indicators are related.

“Generally we learn by examples,” says Nasir Memon, professor of computer science and engineering at NYU Tandon School of Engineering. We get an algorithm and examples, and we learn when we are able to look at a problem and recognize it as similar to other incidents.

Information security is no stranger to machine learning. Many next-generation security defenses already incorporate machine learning, big data, and natural language processing. What’s different with cognitive computing is the fact that it can blend human-generated security knowledge with more traditional security data. Consider how much security knowledge passes through the human brain and comes out in the form of research documents, industry publications, analyst reports, and blogs.

Someone saw or read something and thought it was important enough to write a blog post or a paper about it, says Jeb Linton, the chief security architect of IBM Watson. Cognitive systems can recognize the rich contextual significance of that piece of knowledge and apply traditional machine-generated data to help analysts get a better understanding of what they are seeing.

“It’s about learning how to take human expertise [in the form of blog posts, articles] mostly in the form of language, and to use it as training data for machine learning algorithms,” Linton says.

Technology innovation has to actually address the challenges security professionals are currently facing, or it remains on the fringes as a cool but not practical option. Cognitive security has the potential to reduce incident response times, optimize accuracy of alerts, and stay current with threat research.

“We need to make sure these technologies are actually solving the problems that security professionals are facing, both today and in the future,” wrote Diana Kelley on IBM’s Security Intelligence.

According to recent statistics from IBM Institute of Business Value, 40 percent of security professionals believe cognitive security will improve detection and incident response decision-making capabilities, and 37 percent believe cognitive security solutions will significantly improve incident response time. Another 36 percent of respondents think cognitive security will provide increased confidence to discriminate between innocuous events and true incidents. If security analysts were able to stay current on threats and increase accuracy of alerts, they could also reduce response time.

More than half (57 percent) of security leaders believed that cognitive security solutions can significantly slow the efforts of cybercriminals.

These are high expectations for Watson for Cybersecurity, and IBM is working with eight different universities to feed up to 15,000 new documents into Watson every month, including threat intelligence reports, cybercrime strategies, threat databases, and materials from its own X-Force research library. In the video below, IBM’s Linton and NYU’s Memon talk about how machines learn and what the future of cognitive security technology looks like.

It’s easy to dismiss cognitive technology and its promises of dramatically changing how information security professionals defend themselves from attackers as more buzzwords. But interest from other fields is growing: Cognitive computing is slated to become a $47 billion industry by 2020, according to recent figures from IDC. While cognitive security is still in early stages, information security professionals see how the technology will help analysts make better and faster decision using vast amounts of data.

Source: InfoWorld Big Data

Deep learning is already altering your reality

Deep learning is already altering your reality

We now experience life through an algorithmic lens. Whether we realize it or not, machine learning algorithms shape how we behave, engage, interact, and transact with each other and with the world around us.

Deep learning is the next advance in machine learning. While machine learning has traditionally been applied to textual data, deep learning goes beyond that to find meaningful patterns within streaming media and other complex content types, including video, voice, music, images, and sensor data.

Deep learning enables your smartphone’s voice-activated virtual assistant to understand spoken intentions. It drives the computer vision, face recognition, voice recognition, and natural language processing features that we now take for granted on many mobile, cloud, and other online apps. And it enables computers—such as the growing legions of robots, drones, and self-driving vehicles—to recognize and respond intelligently and contextually to the environment patterns that any sentient creature instinctively adapts to from the moment it’s born.

But those analytic applications only scratch the surface of deep learning’s world-altering potential. The technology is far more than analytics that see deeply into environmental patterns. Increasingly, it’s also being used to mint, make, and design fresh patterns from scratch. As I discussed in this recent post, deep learning is driving the application logic being used to create new video, audio, image, text, and other objects. Check out this recent Medium article for a nice visual narrative of how deep learning is radically refabricating every aspect of human experience.

These are what I’ve referred to as the “constructive” applications of the technology, which involve using it to craft new patterns in new artifacts rather than simply introspecting historical data for pre-existing patterns. It’s also being used to revise, restore, and annotate found content and even physical objects so that they can be more useful for downstream uses.

You can’t help but be amazed by all this until you stop to think how it’s fundamentally altering the notion of “authenticity.” The purpose of deep learning’s analytic side is to identify the authentic patterns in real data. But if its constructive applications can fabricate experiences, cultural artifacts, the historical record, and even our bodies with astonishing verisimilitude, what is the practical difference between reality and illusion? At what point are we at risk of losing our awareness of the pre-algorithmic sources that should serve as the bedrock of all experience?

This is not a metaphysical meditation. Deep learning has advanced to the point where:

Clearly, the power to construct is also the power to reconstruct, and that’s tantamount to having the power to fabricate and misdirect. Though we needn’t sensationalize this, deep learning’s reconstructive potential can prove problematic in cognitive applications, given the potential for algorithmic biases to cloud decision support. If those algorithmic reconstructions skew environmental data too far from bedrock reality, the risks may be considerable for deep learning applications such as self-driving cars and prosthetic limbs upon which people’s very lives depend.

Though there’s no stopping the advance of deep learning into every aspect of our lives, we can in fact bring greater transparency into how those algorithms achieve their practical magic. As I discussed in this post, we should be instrumenting deep learning applications to facilitate identification of the specific algorithmic path (such as the end-to-end graph of source information, transformations, statistical models, metadata, and so on) that was used to construct a specific artifact or take a particular action in a particular circumstance.

Just as important, every seemingly realistic but algorithmically generated artifact that we encounter should have that fact flagged in some salient way so that we can take that into account as we’re interacting with it. Just as some people wish to know if they’re consuming genetically modified organisms, many might take interest in whether they’re engaging with algorithmically modified objects.

If we’re living in an algorithmic bubble, we should at the very least know how it’s bending and coloring whatever rays of light we’re able to glimpse through it.

Source: InfoWorld Big Data

Get started with TensorFlow

Get started with TensorFlow

Machine learning couldn’t be hotter, with several heavy hitters offering platforms aimed at seasoned data scientists and newcomers interested in working with neural networks. Among the more popular options is TensorFlow, a machine learning library that Google open-sourced a year ago.

In my recent review of TensorFlow, I described the library and discussed its advantages, but only had about 300 words to devote to how to begin using Google’s “secret sauce” for machine learning. That isn’t enough to get you started.

In this article, I’ll give you a very quick gloss on machine learning, introduce you to the basics of TensorFlow, and walk you through a few TensorFlow models in the area of image classification. Then I’ll point you to additional resources for learning and using TensorFlow.

Source: InfoWorld Big Data

The best brains: AI systems that predicted Trump's win

The best brains: AI systems that predicted Trump's win

The shock of Donald Trump’s upset victory has begun to wear off. Now the search for answers begins. In particular: How in this age of big data collection and data-crunching analytics could so many polls, economic election models, and surveys–even those by top Republican pollsters—have been so wrong going into election day?

Some got it right—Geda, the mystic monkey from China, and Felix, a Russian polar bear, for starters. A survey of Halloween presidential candidate masks also predicted a Trump presidency, as did “The Simpsons” back in 2000. And there are a lot of Democratic strategists wishing they’d given more credence this past summer to Michael Moore’s analysis of the political landscape, especially in the Rust Belt.

Looking for signs of intelligence

For those who like their predictions brewed with a dash more data, an artificial intelligence system developed by Indian startup Genic.ai successfully predicted not only the Democratic and Republican primaries, but each presidential election since 2004. To come up with its predictions, the MogIA system uses 20 million data points from online platforms such as Google, YouTube, and Twitter to gauge voter engagement.

MogIA found that Trump was topping Barack Obama’s online engagement numbers during the 2008 election by a margin of 25 percent—impressive even after factoring in the greater participation in social media today.

Sanjiv Rai, founder of Genic.ai, admits there are limitations to the data—MogIA can’t always analyze whether a post is positive or negative. Nonetheless, it has been right in predicting that the candidate with the most engagement online wins.

“If you look at the primaries, in the primaries, there were immense amounts of negative conversations that happen with regard to Trump. However, when these conversations started picking up pace, in the final days, it meant a huge game opening for Trump and he won the primaries with a good margin,” Rai told CNBC.

Artificial intelligence has advantages over more traditional data analysis programs. “While most algorithms suffer from programmers/developer’s biases, MoglA aims at learning from her environment, developing her own rules at the policy layer, and developing expert systems without discarding any data,” Rai said. His system could also be improved by more granular data, he told CNBC—for instance, if Google gave MogIA access to the unique internet addresses assigned to each digital device.

“If someone was searching for a YouTube video on how to vote, then looked for a video on how to vote for Trump, this could give the AI a good idea of the voter’s intention,” CNBC wrote. Given the amount of data available online, using social media to predict election results is likely to become increasingly popular.

Still not convinced and wanting to blame James Comey for Clinton’s loss? MogIA predicted a Trump victory before the FBI announced it was examining new Clinton emails.

Answer me this

There are also less data-intensive ways of making accurate predictions. American University professor Allan Lichtman doesn’t rely on social media, poll results, or demographics to predict elections, but he has an even better track record than MogIA: Lichtman has correctly predicted every presidential election since 1984.

Using earthquake prediction methods that gauge stability vs. upheaval, Lichtman says he developed a set of 13 true/false statements that predict elections based on the performance of the party currently in the White House.

“There’s a real theory behind this. And the theory is presidential elections don’t work the way we think they do,” Lichtman told CBSNews. “They’re not decided by the turns of the campaigns, the speeches, the debates, the fundraising. Rather, presidential elections are fundamentally referenda on the performance of the party holding the White House. If that performance is good enough, they get four more years. If it’s not, they’re turned out and the challenging party wins.”

Lichtman says his 13 keys (explained in more depth by the Washington Post) are a historically based system founded on the study of every presidential election from 1860 to 1980. His keys are simply ways of “mathematically and specifically” measuring the incumbent party’s performance based on the following factors:

  1. Party mandate
  2. Contest
  3. Incumbency
  4. Third party
  5. Short-term economy
  6. Long-term economy
  7. Policy change
  8. Social unrest
  9. Scandal
  10. Foreign/military success
  11. Foreign/military failure
  12. Incumbent charisma
  13. Challenger charisma

If six of his statements are false, Lichtman says, the incumbent party loses the presidency.

“Donald Trump’s severe and unprecedented problems bragging about sexual assault and then having 10 or more women coming out and saying, ‘Yes, that’s exactly what you did’—this is without precedent,” Lichtman pointed out in an interview with the Washington Post. But it didn’t change a key. By the narrowest of possible margins, the keys still point to a Trump victory.”

Here’s predicting that MogIA and Lichtman will be closely watched in the next election—in addition to Geda and Felix, of course.

Source: InfoWorld Big Data

Could Google or Facebook decide an election?

Could Google or Facebook decide an election?

At this writing, it’s Wednesday morning after the U.S. election. None of my friends is sober, probably including my editor.

I had a different article scheduled originally, which it made the assumption that I’d been wrong all along, because that’s what everyone said. The first article in which I mentioned President Trump posted on Sept. 10, 2015, and covered data analytics in the marijuana industry. Shockingly, both Trump and marijuana won big.

I thought I was being funny. Part of the reason I was sure “President Trump” was a joke was that Facebook kept nagging me to go vote. First, it wanted me to vote early; eventually it wanted me to vote on Election Day. It wasn’t only Facebook—my Android phone kept nagging me to vote. (You’d think it would have noticed that I’d already voted or at least hung out at one of the polling places it offered to find for me, but whatever.)

This made me think. With the ubiquity of Google and Facebook, could they eventually decide elections? Politics are regional. In my state, North Carolina, if you turn out votes in the center of the state it goes Democratic. If you turn out votes in the east and west, it goes Republican. Political operatives have geographically targeted voters in this manner for years, but they have to pay to get in your face. Google and Facebook are already there.

What if instead of telling everyone to vote, they were to target voters by region? Let’s say Google and Facebook support a fictitious party we’ll call Fuchsia. In districts that swing heavily Fuchsia, they push notifications saying “go vote.” In districts that go for the other guys, they simply don’t send vote notifications and ads and instead provide scant information on polling station locations. That alone could swing some areas.

Targeted notifications could have an even more dramatic effect in districts that could go either way. Google and Facebook collect tons of psychometric data; Facebook even got caught doing it. Facebook and Google don’t only know what you “like” but what you hate and what you fear. Existing political operations know this too, but Google and Facebook have it at a much much more granular level.

To go a step further, what if Facebook manipulated your feed to increase your fear level if fear is the main reason you vote? What if your personalized Google News focused on your candidates’ positives or negatives depending on whether they want you to stay home or go to the polls? In fact, if you incorporate search technology against current events and the news, you could even have articles on other topics that passively mention either your candidate or the candidate you fear.

The point I’m trying to make is that the same technology used to manipulate you into buying stuff can be used to manipulate how or if you vote. We’re still a little away from this, but not far. Even a small amount of targeting could turn a close vote in a key state.

Source: InfoWorld Big Data

Review: Microsoft takes on TensorFlow

Review: Microsoft takes on TensorFlow

Like Google, Microsoft has been differentiating its products by adding machine learning features. In the case of Cortana, those features are speech recognition and language parsing. In the case of Bing, speech recognition and language parsing are joined by image recognition. Google’s underlying machine learning technology is TensorFlow. Microsoft’s is the Cognitive Toolkit. 

Both TensorFlow and Cognitive Toolkit have been released to open source. Both are complex frameworks that implement many neural network and deep learning algorithms. Both present challenges to developers new to the area. Cognitive Toolkit has recently become easier to install and deploy than it was, thanks to an automatic installation script. Cognitive Toolkit may be a little easier to use than TensorFlow right now, but that is balanced by TensorFlow’s wider applicability.

Source: InfoWorld Big Data

Beware dodgy data analysis

Beware dodgy data analysis

Data science is having its 15 minutes of fame.

Everyone from John Oliver of HBO’s “Last Week Tonight” to famed election statistician Nate Silver of 538.com is getting on a soapbox about the perils of believing data-based findings that lead to seemingly crazy conclusions.

John Oliver noted one particularly dodgy finding that a glass of wine was as healthy as an hour at the gym. Another “study” supposedly proved the benefits of a chocolate diet for pregnant moms. And other studies have found that the number of suicides by hanging, strangulation and suffocation is highly correlated with U.S. spending on science, space and technology.

As those of us working in the business/data analytics field know only too well, the thing that each of these strange-but-unfortunately-true studies have in common is a failure to differentiate between data that shows correlations between variables — which is a statistician’s bread and butter — and data that establishes causality — data-tested conclusions that one thing actually causes another.

And while such confusion may not matter much if it leads to a pregnant mom eating an extra Hershey bar or two, it could be deadly to your company’s bottom line.

Source: InfoWorld Big Data

Tame unruly big data flows with StreamSets

Tame unruly big data flows with StreamSets

Internet of things (IoT) data promises to unlock unique and unprecedented business insights, but only if enterprises can successfully manage the data flowing into their organizations from IoT sources. One problem enterprises will encounter as they try to elicit value from their IoT initiatives is data drift: changes to the structure, content, and meaning of data that result from frequent and unpredictable changes to source devices and data processing infrastructure.

Whether processed in stream or batch form, data typically moves from source to final storage locations through a variety of tools. Changes anywhere along this chain — be they schema changes to source systems, shifts in the meaning of coded field values, or an upgrade or addition to the software components involved in data production — can result in incomplete, inaccurate, or inconsistent data in downstream systems.

The effects of this data drift can be especially pernicious because they often go undetected for long periods of times, polluting data stores and subsequent analyses with low-fidelity data. Until detected, the use of this problematic data can lead to false findings and poor business decisions. When the problem is finally detected, it is usually fixed through manual data cleanup and preparation by data scientists, which adds hard costs, opportunity costs, and delays to the analysis.

StreamSets Data Collector

Using StreamSets Data Collector to build and manage big data ingest pipelines will help mitigate the effects of data drift while vastly reducing the amount of time spent cleansing data. In this article, we will walk through a typical use case of real-time data ingest of IoT sensor data into HDFS for analysis and visualization using Impala or Hive.

Without writing a single line of code, StreamSets Data Collector can ingest streaming and batch data from a large number of sources. StreamSets Data Collector can perform transformations and sanitize the data in-stream, then write to a large number of destinations. When the pipeline is placed in operation, you get fine-grained data flow metrics, detection of anomalous data, and alerting so that you can stay on top of pipeline performance. StreamSets Data Collector can run standalone or be deployed onto a Hadoop cluster, and it offers connectors to a variety of data source and destination types.

The following use case involves data generated in real time from shipping containers.

The first example of data drift manifests itself in the IoT sensors that the shipping company uses. Due to upgrades over time, the sensors in the field run one of three different firmware versions. Each revision adds new data fields and changes the schema. To derive value from that sensor data, the system we use to ingest the information must be able to handle this diversity.

Cleanse and route the data

Our pipeline reads data from a RabbitMQ system that receives MQTT messages from the sensors out in the field. We check to verify that the messages we are receiving are those we want to work with. To do so, we use a stream selector processor to specify a data rule for the incoming messages. We then use this rule to declare that all data matching the rule’s criteria is routed downstream, but anything that doesn’t match the criteria will be discarded.

streamsets fig1

We then use another stream selector to route data based on the firmware version of the device. All records matching firmware version 1 go to one path, those matching version 2 go to another, and so forth. We also specify a default catch-all rule to send any outliers to an “error” path. With modern data streams, we fully expect that the data will unexpectedly change, so we can set up graceful error handling that shunts anomalous records to a local file, a Kafka stream, or a secondary pipeline. That way we can keep the pipeline running and simultaneously reprocess data that doesn’t fit the primary intents after the fact.

streamsets fig2

Let’s start with handling data for firmware version 3, which added latitude/longitude data. Right away we want to make sure those fields exist in the data set, and the data contains valid values. Because the location field is a nested structure, we want to flatten it and eventually discard the nested data.

streamsets fig3

Similarly, firmware version 2 contains new orientation fields (raw, pitch, roll), and we can verify and sanitize it in a similar fashion.

Finally, all device versions contain temperature and humidity readings. First, we convert the data types of these readings. Temperature gets converted to a double, humidity to an integer, and the date to a Unix timestamp.

streamsets fig4

We then use a scripting processor to write some custom logic — such as to convert Fahrenheit values to Celsius. StreamSets scripting processors support Jython, Groovy, and JavaScript.

After cleansing the data (that is, routing it based on firmware version and eventual use) we send it into a couple of HDFS destinations.

Configure the destination

StreamSets natively supports a large number of data formats such as plain text, delimited, JSON, Protobuf, and Avro. In this example we will write data to a snappy compressed Avro file.

The HDFS destination is highly configurable. You can configure security as required by your enterprise policies, dynamically configure the path and location of your output files, and even choose to write multiple Cloudera CDH versions.

streamsets fig5

Once you’ve designed the pipeline, you can switch to preview mode to test and debug the data flow using a sample of the data. You can step through each processor and examine the state of the data at any stage.

streamsets fig6

For example, we see below that the data types for reading_date and temperature were converted to long and double. StreamSets will also alert you if a calculation was performed to convert the data.

streamsets fig7

You can also inject outlier or “corner case” data into the stream to see what impacts it has on your flow. Preview mode gives you an easy way to debug complex pipelines without putting the pipelines into production.

Execute the pipeline

Now we’re ready to execute the pipeline and start ingesting data into our cluster. Hit the Start button and the UI will switch to execute mode.

streamsets fig8

At this point, the StreamSets Data Collector starts ingesting data, processing it in memory, and sending data into the destination. The monitoring window at the bottom of the screen displays various real-time metrics such as how many records came in and how many were written out. You can also see how much time is spent on each processor and how much memory it consumes. These metrics and a lot more are also accessible via Java Management Extensions (JMX).

As we drop data into HDFS, we can immediately start querying Impala and running analytics, machine learning, or visualizations.

streamsets fig9

Today, IoT devices, sensor logs, web clickstreams, and other sources of important data are constantly changing as systems are tweaked, updated, or even replatformed by their owners. These changes to data content, structure, behavior, and meaning are unpredictable, unannounced, and never-ending, and they wreak havoc with data processing and analytics systems and operations. StreamSets Data Collector helps manage the constant changes in your data infrastructure, taming data drift and preserving the integrity of your data processing systems.

Arvind Prabhakar is co-founder and CTO at StreamSets, a data performance management platform. He is an Apache Software Foundation member and an Apache Project Management Committee member of the Flume, Sqoop, Storm, and MetaModel projects. Prior to StreamSets, he held engineering roles at Cloudera, Informatica, and Sun Microsystems.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data

Apache Mesos users focus on big data, containers

Apache Mesos users focus on big data, containers

Mesosphere, the main commercial outfit behind the Apache Mesos datacenter and container orchestration project, has taken a good look at its user base and found that they gravitate toward a few fundamental use cases.

Survey data released recently by Mesosphere in the “Apache Mesos 2016 Survey Report,” indicates that Mesos users focus on running containers at scale, using Mesos to deploy big data frameworks, and relying heavily on the core tool set that Mesos and DC/OS provide rather than using substitutes.

We got this contained

Created in 2009, Mesos was built to run workloads of all types and sizes across clusters of systems. DC/OS, released back in 2015 by Mesosphere, automates the deployment, provisioning, and scaling of applications with Mesos as the underlying technology. Thus, it casts Mesos as a commodity similar to Docker, which offers ease in working with long-standing containerization techniques.

The Mesosphere survey doesn’t cover a very large sample of users — fewer than 500, with 63 percent of those surveyed running Mesos for less than a year. Deployments are also modest — the overwhelming majority are fewer than 100 nodes — and by and large favor generic software/IT industry settings. Retail, e-commerce, telecom, and finance made up about 19 percent of the total combined.

Among the workloads deployed in Mesos, the largest slice (85 percent) covers containers and microservices, with 62 percent of all users deploying containers in production. Containers have long been a major part of Mesos’ and DC/OS’s focus, but Mesos sets itself apart from other container projects by providing a robust solution to container management, including native support for GPU-powered applications.

Do it yourself

The second biggest slice of the pie is data-centric applications. No prizes for guessing the top entry in that category: Apache Spark (43 percent of users), followed by other major big data infrastructure components like the Kafka messaging system (32 percent), the Elastic search system (26 percent), and the Cassandra NoSQL database (24 percent). Hadoop is in the mix as well, but only at 11 percent.

If there’s a takeaway to be found, it’s that specific solutions like Spark demonstrate more immediate payoffs than general solutions like Hadoop, especially when projects like DC/OS make them easier to deploy.

The survey also makes clear that Mesos users have historically put together projects themselves, but they like the idea of having the option to not have to. Of those who use Mesos, few currently do so with DC/OS’s automated deployment. Only 26 percent of those surveyed are running it in a production context, with another 12 percent “piloting for broader deployment.” That implies that many existing Mesos-powered deployments are hand-built.

However, newly minted Mesos users are going straight to DC/OS to get their Mesos fix. Eighty-seven percent of users who started with Mesos in the past six months did so through DC/OS. Thus, it’s safe to assume as DC/OS becomes more widely used and Mesos continues to evolve (it recently hit a 1.0 release), DC/OS will become the predominant preference to deploy both Mesos and all the apps that run with it.

It’s important to think about Mesos and DC/OS as complementary technologies to the rest of the container world, not total replacements for it. Kubernetes, for instance, can run in Mesos (and 8 percent of the respondents do use Kubernetes somewhere, according to the survey). Rather than eclipsing such arrangements outright, it’s more likely that DC/OS and Mesos will provide a more convenient option to build with them.

Source: InfoWorld Big Data

Redis module speeds Spark-powered machine learning

Redis module speeds Spark-powered machine learning

In-memory data store Redis recently acquired a module architecture to expand functionality. The latest module is a machine learning add-on that accelerates delivery of results from trained data rather than training itself.

Redis-ML, or the Redis Module for Machine Learning, comes courtesy of the commercial outfit that drives Redis development, Redis Labs. It speeds the execution of machine learning models while still allowing those models to be trained in familiar ways. Redis works as an in-memory cache backed by disk storage, and its creators claim machine learning models can be executed orders of magnitude more quickly with it.

The module works in conjunction with Apache Spark, another in-memory data-processing tool with machine learning components. Spark handles the data-gathering phase, and Redis plugs into the Spark cluster through the pre-existing Redis Spark-ML module. The module generated by Spark’s training is then saved to Redis, rather than to an Apache Parquet or HDFS data store. To execute the models, you run the queries on the Redis-ML module, not Spark itself.

In the big picture, Redis-ML offers speed: faster responses to individual queries, smaller penalties for large numbers of users making requests, and the ability to provide high availability of the results via a scale-out Redis setup. Redis Labs claims the prediction process shows “5x to 10x latency improvement over the standard Spark solution in real-time classifications.”

Another boon is specifically for developers, as Redis-ML interoperates with Scala, JavaScript (via Node.js), Python, and the .Net languages. Models “are no longer restricted to the language they were developed in,” states Redis Labs, but “can be accessed by applications written in different languages concurrently using the simple [Redis-ML] API.” Redis Labs also claims the resulting trained model is easier to deploy, since it can be accessed through said APIs without custom code or infrastructure.

Accelerating Spark with other technologies isn’t a new idea. Previously, the idea was to speed up the storage back ends that Spark talks to. In fact, Redis’ engineers herald it as one such solution. Another project, Apache Arrow, speeds Spark execution (and other big data projects) by transforming data into a columnar format that can be processed more efficiently.

Redis Labs is pushing Redis as a broad solution to these problems, since its architecture (what its creators call a “structure store”) permits more free-form storage than competing database solutions. Redis VP of Product Management Cihan Biyikoglu noted in a phone interview that other databases attempt to adapt data types to the problems at hand; Redis, by contrast, instead of “shackling [you] to one data model, type, or representation,” allows “an abstraction that can house any type of data.”

If Redis Labs has a long-term plan, it’s to inch Redis toward becoming an all-in-one solution for machine learning — to provide a data-gathering and data-querying mechanism along with the machine learning libraries under one roof. To wit: Another Redis module, for Google’s TensorFlow framework, not only allows Redis to serve as backing for TensorFlow, but allows training TensorFlow models directly inside Redis.

Source: InfoWorld Big Data