Q&A: Hortonworks and IBM double down on Hadoop

Q&A: Hortonworks and IBM double down on Hadoop

Hortonworks and IBM recently announced an expanded partnership. The deal pairs IBM’s Data Science Experience (DSX) analytics toolkit and the Hortonworks Data Platform (HDP), with the goal of extending machine learning and data science tools to developers across the Hadoop ecosystem. IBM’s Big SQL, a SQL engine for Hadoop, will be leveraged as well.

InfoWorld Editor at Large Paul Krill recently met with Hortonworks CEO Rob Bearden and IBM Analytics general manager Rob Thomas at the DataWorks Summit conference in Silicon Valley, to talk about the state of big data analytics, machine learning, and Hadoop’s standing among the expanding array of technologies available for large-scale data processing.

InfoWorld: What does IBM Data Science Experience bring to the Hadoop Data Platform?

Thomas: We launched Data Science Experience last year and the idea was we saw a change coming in the data science market. Traditionally, organizations were either SPSS users or SAS users but the whole market was moving toward open languages. We built Data Science Experience on Jupyter. It’s focused on Python data scientists, R, Spark, Scala programmers. You can use whatever language you want.

And you can use whatever framework you want for the machine learning underneath. You can use TensorFlow or Caffé or Theano … It’s really an open platform for data science. We focus on the collaboration, how you get data scientists working as a team as part of doing that. Think about Hadoop. Hadoop has had an enormous run in the last five to six years in enterprises. There is a lot of data in Hadoop now. There is not super value for the client by just having data there. Sometimes, there is some cost savings. Where there is super value for the client is they actually start to change how they’re interacting with that data, how they’re building models, discovering what’s happening in there.  

InfoWorld: IBM has a well-known experience with machine learning with Watson. Hortonworks has positioned Apache Spark and Hadoop as its entrance into the machine learning space. Can you discuss the company’s future plans for machine learning, AI, and data science?

Bearden:  It’s going to be through the DSX framework and the IBM platforms that come through that. Hadoop and HDP will continue to be the platform. We’ll leverage some of the other processing platforms collectively like Spark and there’s a tremendous amount of work that IBM’s done to advance Spark. We’ll continue to embody that inside of HDP through YARN but then on top of all of these large data sets, we’ll leverage DSX and the rest of the IBM tool suite. We expressed that DSX and the rest of the tool suite from IBM for machine learning, deep learning, and AI will be our strategic platforms going forward and we’re going to co-invest very deeply to make sure all the integration is done properly. That goes back to being able to bring all resources into a focused distribution so that we can not only innovate horizontally but integrate vertically.

InfoWorld: InfoWorld ran a story late last year claiming that Hadoop had peaked, that other big data infrastructure including Spark, MongoDB, Cassandra, and Kafka were marching past it. InfoWorld asked Hortonworks CTO Scott Gnau a similar question last year. What can you say about the continued vitality of Hadoop?

Bearden: We’re a public company and we’re continuing to grow at 24 to 30 percent a year. The way we get paid is by bringing data under management. That’s one vector and it’s just a quantitative data point. I think what you have to then revert backwards to is, is the volume of data growing in the enterprise? According to just about any CIO you’ll speak with or any of the traditional industry analysts, and I think Rob will back this up, about every 18 months the volume of data doubles across the enterprise. About 70 to 80 percent of that data is not going to go into the traditional data platforms, the traditional SQL transactional EDW, etc., and they’re looking for that new area to come to rest, if you will. Hadoop is the right platform and architecture for that to happen. That’s why this partnership is so important. We’re great at landing that data, bringing it under management, securing it, providing the governance, etc., and being able to drive mission-critical marks on some pretty good economics. But what the enterprise really wants is the ability to gain insight from it, to get access to it, to have visibility, to be able to act on a decision and create an action that drives value for an application.

Thomas: Maybe the hype peaked but the hype always peaks when the hard work starts. I think Hadoop is still in its early days. We’ll look back at some point and it will be like sitting here in 1992 saying relational warehouses have peaked. It was just the start. We’re in the same place but the hard work has begun, which is—all right, now we’ve got the data there, how do I actually integrate this across my whole data landscape, which is why Scott talked a lot about Big SQL and what we’re doing there. That’s a really hard problem and if people don’t solve that then there’s probably a natural limitation to how much they could do with Hadoop. But together we solve that problem to the point of the whole discussion on data science, data governance. When you bring those things to Hadoop and you do it at scale, it again changes the opportunity for how fast and how widely Hadoop can be deployed.

InfoWorld: What’s going to happen with the evolution of YARN? What’s next on the roadmap for it?

Bearden: The notion of containers and having the ability to then take a container-based approach to applications and being able to do that as an extension through YARN is actually part of the roadmap today. We published that and we think that opens up new use cases and applications that can leverage Hadoop.

You go back to the ability to get to existing applications, whether it be fraud detection, money laundering, two of the typical ones that you look at in financial services. Rapid diagnostics in the healthcare world, being able to get to better processing for genomics… analyzing the genome for certain kinds of diseases and being able to take those existing algorithms or applications and moving them over to the data via a container approach. You can do that much cleaner with YARN.

InfoWorld: Is there anything else you want to mention?

Thomas: I’d mention just one more point around data governance. We started working with Hortonworks over the last, oh, 18 months around a project called Atlas. I’d say it’s just coming into form as we’ve both been working with a lot of clients and we view it as a key part of our joint strategy around how we’re going to approach data governance. You use data governance for compliance. You use data governance for insights. There’s a big compliance mandate with things like GDPR (General Data Protection Regulation) that’s happening right now in Europe. I think you’ll see more and more on this topic in the future from us.

Source: InfoWorld Big Data

Q&A: Hortonworks CTO unfolds the big data road map

Q&A: Hortonworks CTO unfolds the big data road map

Hortonworks has built its business on big data and Hadoop, but the Hortonworks Data Platform provides analytics and features support for a range of technologies beyond Hadoop, including MapReduce, Pig, Hive, and Spark. Hortonworks DataFlow, meanwhile, offers streaming analytics and uses technologies like Apache Nifi and Kafka.

InfoWorld Executive Editor Doug Dineley and Editor at Large Paul Krill recently spoke with Hortonworks CTO Scott Gnau about how the company sees the data business shaking out, the Spark vs. Hadoop face-off, and Hortonworks’ release strategy and efforts to build out the DataFlow platform for data in motion.

InfoWorld: How would you definite Hortonworks’ present position?

Gnau: We sit in a sweet spot where we want to leverage the community for innovation. At the same time, we also have to be somewhat the adult supervision to make sure that all this new stuff, when it gets integrated, works. That gets to one core belief that we have, that we really are responsible for a platform and not just a collection of tech. We’ve modified the way that we bring new releases to market such that we only rebase the core. When I say “rebase the core,” that means new HDFS, new Yarn. We only rebase the core once a year, but we will integrate new versions of projects on a quarterly basis. What that allows us to do, when you think about when you rebase the core or when you bring in changes to the core Hadoop functionality, there’s a lot of interaction with the different projects. There’s a lot of testing, and it introduces instability. It’s software development 101. It’s not that it’s bad tech or bad developers. It introduces instability.

InfoWorld: This rebasing event, do you aim to do that at the same time each year?

Gnau: If we do it annually, yes, it will be at the same time each year. That would be the goal. The next target will be in the second half of 2017. In between, up to as frequently as quarterly, we will have nonrebasing releases where we’ll either add new projects or add new functionality or newer versions of projects to that core.

How that manifests itself is in a couple of advantages. Number one is we think we can get newer stuff out faster in a way that’s more consumable because of the stability that it implies for our customers. We also think conversely, that our customers will be more amenable to staying closer to the latest release because it’s very understandable what’s in and what changed.

The example I have for that is we recently did the 2.5 release, and basically in 2.5, there were only two things we changed: Hive and Spark. It makes it very easy if you think about a customer who has their operations staff running around doing change management. Inside of it, we actually allowed for the first time that customers could choose a new version of Spark or the old version of Spark or actually run both at the same time. Now if you’re running change management, you’re saying, “OK, I can install all the new software, and I can default it to run on the old version of Spark, so I don’t have to go test anything.” Where I have feature functionality that wants to take advantage of the new version of Spark, I can simply have them use that version for those applications.

InfoWorld: There’s been talk that Spark is displacing Hadoop. What’s happening as far as Spark versus Hadoop?

Gnau: I don’t think it’s Spark versus Hadoop. It’s Spark and Hadoop. We’ve been very successful and a lot of customers have been very successful down that path. I mentioned that even in our new release where, when the latest version of Spark came out, within 90 minutes of it being published to Git, it was in our distribution. We’re highly committed to that as an execution engine for the use cases where it’s popular, so we’ve invested not only in the packaging, but also with the contributions and committers we have, and in tools like Apache Zeppelin, which enables data scientists and Spark users to create notebooks and be more efficient about how they share algorithms and how they optimize the algorithms that they’re writing against those data sets. I don’t view it as either/or but more as an “and.”

In the end, for business-critical applications that are making a difference and are customer-facing, there is a lot of value behind the platform from a security, operationalization, backup and recovery, business continuity, and all those things that come with a platform. Again, I think the “and” becomes more important than the “or.” Spark is really good for some workloads and really horrible for others, so I don’t think it’s Spark versus the world. I think it’s Spark and the world for the use cases where it makes sense.

InfoWorld: Where does it make sense? Obviously you’re committed to Hive for SQL. Spark also offers a SQL implementation. Do you make use of that? This space is interesting in that all these platform vendors want to offer every tool for basically every kind of processing.

Gnau: There are Spark vendors that want to offer only Spark.

InfoWorld: That’s true. I’m thinking of Cloudera, you and MapR, the established Hadoop vendors. These platforms have lots of tools, and we’d like to understand which of those tools are being used for what sorts of analytics.

Gnau: Simplistic, interactive on reasonably small sets of data fit Spark. If you get into petabytes, you’re not going to be able to buy enough memory to make Spark work effectively. If you get into very sophisticated SQL, it’s not going to run. Yes, there are many tools for many things, and ultimately there is that interactive, simplistic, memory resident on small data use case that Spark fits. With any of those parameters, when you start to get to the bleeding edge of any of those parameters it’s going to be less effective, and the goal is to have that then bleed into Hive.

InfoWorld: How opinionated can you be about your platform and how free are you in deciding you are no longer going to support a tool or are retiring a tool?

Gnau: The hardest thing any product company can do is retire a product, the most horrid thing in the world. I don’t know that you will see us retire a whole lot, but maybe there will be things that get put out to pasture. The nice thing is that there is still a live community out there, so even though we may not be focused on trying to drive investment because we’re not seeing demand in the market, there will still be a community [that] can go out and pick up things, so I see it more as an out to pasture.

InfoWorld: To take one example, Storm is still obviously a core element and I assume that’s because you’ve decided it’s a better way to do stream processing than Spark or others.

Gnau: It’s not a better way. It provides windowing functions, which are important to a number of use cases. I can imagine a world where you’ll write SQL and you’ll send that SQL off, and we’ll grab it and we’ll actually help decide how it should run and where it should run. That’s going to be necessary for the thing itself to be sustainable.

There are some capabilities along those lines that we’re doing here and there as placeholders, but I think as an industry, if we don’t make it simpler to consume, there will be a problem industry-wide, regardless of whether we’re smart or Cloudera is smart, whatever. It will be an industry problem because it won’t be consumable by the masses. It’s got to be consumable and easy. We’re going to create some tools that will help you decide how you deploy and help you manage where you can have an application that thinks they’re talking to an API versus I’ve got to run Hive for this and HBase for this and having to understand all those different things.

InfoWorld: Can you identify technologies that are emerging that you expect to be in the platform in the coming year or so?

Gnau: The biggest thing that is important is the whole notion of data in motion versus data at rest. When I say “data in motion,” I’m not talking about just streaming. I’m not talking about just data flow. I’m talking about data that’s moving and how do you do all of those things? How do you apply complex event processing, simple event processing? How do you actually guarantee delivery? How do you encrypt and protect and how do you validate and create provenance, all the provenance in data in motion? I see that as a huge bucket of opportunity.

Obviously, we made the acquisition of Onyara and released Hortonworks DataFlow based on Apache NiFi. Certainly that’s one of the most visible things. I would say that’s it is not NiFi alone and what you would see inside of our Hortonworks DataFlow is that includes NiFi and Storm and Kafka, a bunch of components. You’ll see us building out DataFlow as a platform for data in motion, and we already have and will continue to invest along those lines. When I’m out and about and people say, “What do you think about streaming?” I say, well, streaming is a very small subset of the data-in-motion problem. It’s an important thing to solve. but we need to think about it as a bigger opportunity because we don’t want to solve just one problem and then have six other problems that prevent us from being successful. That’s going to be driven by devices, IoT, all the buzzwords out there.

InfoWorld: In this data-in-motion future, how central or how important is a time series database, a database built to store time series data as opposed to using something else?

Gnau: Time series analytics are important. I would submit that there are a number of ways that those analytics can be engineered. Time series database is one of the ways. I don’t know that a specific time series database is required for all the use cases. There may be other ways to get to the same answer, but time series and the temporal nature of data are increasingly important, and I think you will see some successful projects come up along those lines.

Source: InfoWorld Big Data

Facebook taps deep learning for customized feeds

Facebook taps deep learning for customized feeds

Serving more than a billion people a day, Facebook has its work cut out for it when providing customized news feeds. That is where the social network giant takes advantage of deep learning to serve up the most relevant news to its vast user base.

Facebook is challenged with finding the best personalized content, Andrew Tulloch, Facebook software engineer, said at the company’s recent @scale conference in Silicon Valley. “Over the past year, more and more, we’ve been applying deep learning techniques to a bunch of these underlying machine learning models that power what stories you see.”

Applying such concepts as neural networks, deep learning is used in production in event prediction, machine translation models, natural language understanding, and computer vision services. Event prediction, in particular, is one of the largest machine learning problems at Facebook, which must serve the top couple of stories out of thousands of possibilities for users, all in a few hundred milliseconds. “Predicting relevance in and of itself is a very challenging problem in general and relies on understanding multiple content modalities like text, pixels from images and video, and the social context,” Tulloch said.

The company must also deal with content posted in more than 100 languages daily, thus complicating classic machine learning, Tulloch said. Text must be understood at a deep level for proper ranking and display. In its deep learning efforts, Facebook has gone with its DeepText text understanding engine, which reads and understands users’ posts and has been open-sourced in part.

Tech jobs report: Security, devops, and big data stay hot

Tech jobs report: Security, devops, and big data stay hot

If you’re wondering what IT skill sets to acquire, security and devops are doing well in the job market. Pay for cloud skills, however, is eroding.

Research firm Foote Partners’ latest quarterly IT Skills and Certifications Pay Index determined that the market value for 404 of the 450 IT certifications it tracks had increased for 12 consecutive quarters. Market values rose for noncertified IT skills for the fifth consecutive quarter.

Foote’s report is based on data provided by 2,845 North American private and public sector employers, with data compiled from January to April 1. (Noncertified skills include skills that are in demand but for which there is no official certification, Foote spokesman Ted Lane noted.)

Security skills command increasing salaries, with no end in sight

In the security space, Foote found that values for 76 certifications have been on a slow and steady path upward for two years, with an 8.7 percent average increase. The certifications’ values have risen 6.3 percent in the past year. “Strong-performing certifications in the first three months of 2016 were those in IT security management and architecture, penetration testing, forensics, and cybersecurity,” the report said.