The era of the cloud database has finally begun

The era of the cloud database has finally begun

Folks, it’s happening. Although enterprises have spent the last few years shifting on-premises workloads to the public cloud, databases have been a sticking point. Sure, Amazon Web Services can point to 64,000 database migrations over the last two years, but that still leaves millions more stuck in corporate datacenters.

But not, it would appear, for long.

Ryanair, Europe’s largest airline, just signaled a significant shift in cloud migrations, announcing that it is “going all-in” on AWS, moving its infrastructure to the cloud leader. But what makes this so important is that it also includes mention of Ryanair moving away from Microsoft SQL Server and replacing it with Amazon Aurora, “standardizing on … AWS databases.”

When companies embrace cloud databases wholesale, it’s effectively game over.

Why migrating databases to the cloud has been so hard

Source: InfoWorld Big Data

All your streaming data are belong to Kafka

All your streaming data are belong to Kafka

Apache Kafka is on a roll. Last year it registered a 260 percent jump in developer popularity, as Redmonk’s Fintan Ryan highlights, a number that has only ballooned since then as IoT and other enterprise demands for real-time, streaming data become common. Hatched at LinkedIn, Kafka’s founding engineering team spun out to form Confluent, which has been a primary developer of the Apache project ever since.

But not the only one. Indeed, given the rising importance of Kafka, more companies than ever are committing code, including Eventador, started by Kenny Gorman and Erik Beebe, both co-founders of ObjectRocket (acquired by Rackspace). Whereas ObjectRocket provides the MongoDB database as a service, Eventador offers a fully managed Kafka service, further lowering the barriers to streaming data.

Talking with the Eventador co-founders, it became clear that streaming data is different, requiring “fresh eyes” because “data being mutated in real time enables new use cases and new possibilities.” Once an enterprise comes to depend on streaming data, it’s hard to go back. Getting to that point is the key.

Kafka vs. Hadoop

As popular as Apache Hadoop has been, the Hadoop workflow is simply too slow for the evolving needs of modern enterprises. Indeed, as Gorman tells it, “Businesses are realizing that the value of data increases as it becomes more real-time.” For those companies that prefer to wait on adding a real-time data flow to their products and services, they risk the very real likelihood that their competitors are not content to sit on their batchy laurels.

This trend is driving the adoption of technologies that can reliably and scalably deliver and process data as near real-time as possible. New frameworks dedicated to this architecture needed to exist. Hence, Apache Kafka was born.

What about Apache Spark? Well, as Gorman points out, Spark is capable of real-time processing, but isn’t optimally suited to it. The Spark streaming frameworks are still micro-batch by design.

This leaves Kafka, which “can offer a true exactly once, one-at-a-time processing solution for both the transport and the processing framework,” Gorman explains. Beyond that, additional components like Apache Flink, Beam, and others extend the functionality of these real-time pipelines to allow for easy mutation, aggregation, filtering, and more. All the things that make a mature, end-to-end, real-time data processing system.

Kafka’s pub-sub model

This wouldn’t matter if Kafka were a beast to learn and implement, but it’s not (on either count). As Gorman highlights, “The beauty of Apache Kafka is it exposes a powerful API yet has very simple semantics. It is all very approachable.” Not only that, but its API has been implemented in many different programming languages, so the odds are good that your favorite language has a driver available.

Kafka has the notion of a topic, which is simply a namespace for a stream of data. It’s very simple to publish data to a topic, and Kafka handles the routing, scalability, durability, availability, etc. Multiple consumers coordinate subscription to these topics, to fetch data and process or route it. Asked about how this translates into the application development experience, Gorman stressed that it’s not trivial but it’s straightforward: “Building applications that work with Kafka is fairly easy [as] the client libraries handle much of the nuances of the communication, and developers utilize the API to publish or subscribe to streams of data.”

The problem, if any, isn’t the technology. Rather, it’s a question of paradigms.

The real trick for developers, Gorman tells me, is “to think about using streaming data with a fresh pair of eyes.” Why? Because “data being mutated in real time enables new use cases and new possibilities.”

Let’s look at a tangible example. Perhaps a client publishes data about ridership of a ride-sharing service. One set of consumers analyzes this stream to perform machine learning algorithms for dynamic pricing, then another set of consumers reads the data to provide location and availability of the cars to customers’ mobile devices. Yet another consumer feeds an aggregation framework for ridership data to internal dashboards. Kafka is at the core of a data architecture that can feed all kinds of business needs, all real-time.

Kafka in the cloud

This is great for developers and the companies for which they work, but Kafka demand is no guarantee of Eventador’s success, given that it has to compete with Confluent, which has the distinction of being the founder of Kafka. What’s more, Confluent, too, has announced a cloud offering that likely will compete with Eventador’s Kafka service.

Gorman is not bothered. As he describes,

The real difference is that we aren’t limited just to Kafka. We use Kafka where it makes the most sense. We are an end-to-end, enterprise-grade, stream processing framework built on Apache Kafka and Apache Flink. We have connectors for AWS S3, a REST interface, integration with PrestoDB and Jupyter notebooks, as well as connections for popular databases and even other streaming systems like AWS Kinesis. We offer plans from a simple single node to full on-prem enterprise configurations.

Besides, given the booming demand for real-time data, Gorman believes there is room for many different players. Not only does Eventador complement Kafka with Flink and more, it has taken to heart Rackspace’s mantra for “fanatical customer support,” which starts with a well-built, fully integrated product. Having spent decades doing operations for some of the world’s largest companies, Gorman continues, “We know what it means to run a first class, professional quality, rock solid, as-a-service offering.”

He’s absolutely right that the market is still young. Developers are still working to understand how Kafka can be integrated into their projects. The use cases are expanding every day, driven by this need to compete with data.

Years from now, however, “It will be common to rely on streaming data in your infrastructure,” Gorman points out, “and not just some ancillary workload.” This is the future they’re building for. “Once you start expecting data to be more real-time, it’s hard to stop.” Eventador, Confluent, and undoubtedly others are building for this real-time, streaming data future. For some, that future is now. For others, these startups hope to get them there sooner.

Source: InfoWorld Big Data

NoSQL, no problem: Why MySQL is still king

NoSQL, no problem: Why MySQL is still king

MySQL is a bit of an attention hog. With relational databases supposedly put on deathwatch by NoSQL, MySQL should have been edging gracefully to the exit by now (or not so gracefully, like IBM’s DB2).

Instead, MySQL remains neck-and-neck with Oracle in the database popularity contest, despite nearly two decades less time in the market. More impressive still, while Oracle’s popularity keeps falling, MySQL is holding steady. Why?

An open gift that keeps on giving

While both MySQL and Oracle lost favor relative to their database peers, as measured by DB-Engines, MySQL remains hugely popular, second only to Oracle (and not by much):

mysql rankingDB-Engines

Looking at how these two database giants are trending and adding in Microsoft SQL Server, only MySQL continues to consistently grow in popularity:

mysql searchGoogle

While general search interest in MySQL has fallen over the years, roughly in line with falling general search interest in Oracle and Microsoft SQL Server, professional interest (as measured by Stack Overflow mentions) has remained relatively firm. More intriguing, it dwarfs every other database:

mysql stack overflowStack Overflow

The script wasn’t written this way. NoSQL, as I’ve written, boomed in the enterprise as companies struggled to manage the volume, velocity, and variety of modern data (the three V’s of big data, according to Gartner). Somehow MySQL not only survived, but thrived.

Like a comfortable supershoe

Sure, NoSQL found a ready audience. MongoDB, in particular, has attracted significant interest, so much so that the company is now reportedly past $100 million in revenue and angling to IPO later this year.

Yet MongoDB hasn’t toppled MySQL, nor has Apache Cassandra or Apache Hadoop, as former MySQL executive Zack Urlocker told me: “MongoDB, Cassandra, and Hadoop all have worthwhile specialized use cases that are sufficiently hard to do in [a] relational database. So they can be decent sized businesses (less than $100 million) but they are unlikely to be as common as relational.” Partly this stems from the nature of most big data today: still transactional in nature, and hence well-suited to the neat rows and columns of an RDBMS.

This coincides with the heart of MySQL’s popularity: It’s a great database that fits the skill sets of the broadest population of database professionals. Even better, they can take all they learned growing up with Oracle, IBM DB2, and Microsoft SQL Server and apply it to an omnipresent, free, and open source database. What’s not to love?

Scale, for one.

Actually, that was the original rap against MySQL and all relational databases: They could scale up but not out, and we live in a scale-out world. As it turns out, “It actually can scale” quite well, Linux Foundation executive Chris Aniszczyk affirmed to me. While it may have started from an architecturally underprivileged standpoint, engineers at the major web companies like Google and Facebook had huge incentives to engineer scale into it. As examples of MySQL at scale proliferated, Pivotal vice president James Bayer suggested to me, it bred confidence that MySQL was a strong go-to option for demanding workloads.

This isn’t to suggest that MySQL is an automatic winner when it comes to scale. As developer DJ Walker-Morgan puts it, “NoSQL takes care of scaling like me buying diet food takes care of weight loss: only if strict disciplines and careful management is applied.” Again, enough examples exist that developers are motivated to give it a try, especially since it’s so familiar to a broad swath of the DBA community. Also, as Server Density CEO David Mytton underscored to me, “[M]anaged services like RDS … [and] Aurora in particular solve[] a lot of scale pain” for MySQL.

Which is why, 22 years after it first hit the proverbial shelves, MySQL is arguably the most popular database on earth. It doesn’t have the “enterprise grade” label that Oracle likes to slap on its database, and it doesn’t have the “built for horizontal scale” marketing that carried NoSQL so far, but it’s the default choice for yesterday’s and today’s generation of developers.

The fact that it’s free doesn’t hurt, but the fact that it’s a free, powerful, familiar relational database? That’s a winning combination.

Source: InfoWorld Big Data

Why Splunk keeps beating open source competitors

Why Splunk keeps beating open source competitors

All essential data infrastructure these days is open source. Or rather, nearly all — Splunk, the log analysis tool, remains stubbornly, happily proprietary. Despite a sea of competitors, the best of them open source, Splunk continues to generate mountains of cash.

The question is why. Why does Splunk exist given that “no dominant platform-level software infrastructure has emerged in the last 10 years in closed-source, proprietary form,” as Cloudera co-founder Mike Olson has said? True, Splunk was founded in 2003, 10 years before Olson’s declaration, but the real answer for Splunk’s continued relevance may come down to both product completeness and industry inertia.

Infrastructure vs. solution

To the question of why Splunk still exists in a world awash in open source alternatives, Rocana CEO Omer Trajman didn’t mince words in an interview: “We could ask the same question of the other dinosaurs that have open source alternatives: BMC, CA, Tivoli, Dynatrace. These companies continue to sell billions of dollars a year in software license and maintenance despite perfectly good alternative open source solutions in the market.”

The problem is that these “perfectly good open source solutions” aren’t — solutions, that is.

As Trajman went on to tell me, open source software tends to “come as a box of parts and not as a complete solution. Most of the dollars being spent on Splunk are from organizations that need a complete solution and don’t have the time or the talent to build a do-it-yourself alternative.”

Iguaz founder and CTO Yaron Haviv puts it this way: “Many [enterprises] also look for integrated/turn-key [solutions] vs DIY,” with open source considered the ultimate do-it-yourself alternative.

Sure, the “path to filling gaps” between Elasticsearch and Splunk may be “obvious,” Trajman continues, but “executing on it is less than trivial.” Nor is this the hardest problem to overcome.

An industry filled with friction

That problem is inertia. As Trajman told me, “Every company that runs Splunk [13,000 according to their latest earnings report], was once not running Splunk. It’s taken nearly 14 years for those massive IT ships to incorporate Splunk into their tool chest, and they still continue to run BMC, CA, Tivol and Dynatrace.” As such, “Even if the perfect out-of-the-box open source solution were to magically make its way onto every Splunk customer’s desks, they would still use Splunk, at least for some transitionary period.”

In other words, even if companies are embracing open source alternatives in droves, we’re still going to see healthy Spunk adoption.

It doesn’t hurt that Splunk, unlike its open source competitors, gets pulled into all sorts of jobs for which it offers a good enough, though not perfect, fit. According to Box engineer Jeff Weinstein, “misuse” is a primary driver of Splunk’s continued adoption, by which he means enterprises pushing data into Splunk for jobs it may not be particularly well-suited to manage. Splunk is flexible enough, he points out, that you “can abuse Splunk syntax to do anything and it kind [of] works on long historical time scale back data.” This means, Weinstein says, that “for many companies, [Splunk] is the ad hoc query system of last resort.” Open source options may abound, he notes, but don’t “give as much flexibility on query.”

Moreover, Splunk is “trusted,” Weinstein concludes, in an “old-school IBM style.” That is, not everyone may love it but at least “no one hates it.”

In short, while there are signs that open source alternatives like Elastic’s ELK will continue to progress, it’s unclear that any of these open offerings will seriously dent Splunk’s proprietary approach. Splunk simply offers too much in a world that prizes flexibility over an open license. This may not be the case five years from now, but for now Splunk stands supreme in a market that has otherwise gone wholesale for open source.

Source: InfoWorld Big Data

Unlike big data, IoT may live up to the hype

Unlike big data, IoT may live up to the hype

Big data has long promised more than it delivers, at least for most enterprises. While a shift to cloud pledges to help, big data deployments are still more discussed than realized, with Gartner insisting that only 14 percent of enterprises have gotten Hadoop off the ground.

Will the other darling of the chattering class, IoT (internet of things), meet the same fate? In fact, IoT might deliver, according to new data from Talend compiled in conjunction with O’Reilly. Dubbing 2016 “the year IoT ‘grew up,'” the report declares 2017 the year that “IoT starts to become essential to modern business.”

How and where IoT gets real, however, may surprise you.

The new hyped kid on the block

IoT has been proclaimed the $11 trillion savior of the global economy, which has translated into IoT becoming even bigger than big data, at least in terms of general interest. This Google Trends chart shows IoT surpassing big data in search instances around the middle of last year:

iot trendGoogle Trends

If we get more specific on “big data” and instead use Apache Hadoop, Apache Spark, or MongoDB, all hugely popular big data technologies, the crossover is even more pronounced. IoT has arrived (without its security intact, but why quibble?). Indeed, as the Talend report avers, “[W]hile the buzz around big data is louder, the actual adoption of big data in industry isn’t much larger than the adoption of IoT.”

That’s right: IoT is newer, yet sees nearly as much adoption as big data. In fact, IoT, as the source for incredible amounts of data, could actually be what makes big data real. The question is where.

Betting on boring

The answer to that question, according to the Talend report, which trawled through more than 300TB of live data to glean its insights, is not where the analysts keep insisting:

We found that IoT spending today is for use cases that are much different than those predicted by McKinsey, Gartner, and others. For example, the greatest value/consumer surplus predicted by McKinsey was in factories around predictive maintenance and inventory management, followed by healthcare and smart city–related use cases like public safety and monitoring. While these use cases may be the top producers of surplus in 2025, we do not see much spend on those use cases today. In contrast, home energy and security is low on the McKinsey list, but that’s where the market is today, in addition to defense and retail.

It’s not that the analysts are wrong when they pick out details like industrial automation as incredibly ripe for IoT disruption, so long as we don’t assume “ripe” means “developed to the point of readiness for harvesting or eating.” Given the complexity of introducing significant changes into something like factory automation, such industries most definitely are not “ripe” for IoT. The potential is huge, but so are the pitfalls holding back change.

Home energy and security, by contrast, are relatively straightforward. Or, as the report continues, areas like health care are in desperate need of disruption, but the likes of online patient monitoring “seems 100 times more complex than simple home monitoring or personalized displays for in-store customers.”

Hence, home energy (9 percent) and security (25 percent) accounts for the biggest chunk of IoT deployments in 2016, with defense (14 percent) and retail (11 percent) also significant. Health care? A mere 4 percent.

Given that regulation and complexity are inimical to real-world IoT adoption, it’s perhaps not surprising that unlike big data, which is mostly a big company phenomenon, IoT shows “more continuous adoption … across large and small companies.” As such, IoT deployments are more evenly spread across geographies, rather than following big data’s concentration on the coasts.

In sum, IoT could well end up being a truly democratizing trend, a “bottom-up” approach to innovation.

Source: InfoWorld Big Data

Hadoop vendors make a jumble of security

Hadoop vendors make a jumble of security

A year ago a Deutsche Bank survey of CIOs found that “CIOs are now broadly comfortable with [Hadoop] and see it as a significant part of the future data architecture.” They’re so comfortable, in fact, that many CIOs haven’t thought to question Hadoop’s built-in security, leading Gartner analyst Merv Adrian to query, “Can it be that people believe Hadoop is secure? Because it certainly is not.”

That was then, this is now, and the primary Hadoop vendors are getting serious about security. That’s the good news. The bad, however, is that they’re approaching Hadoop security in significantly different ways, which promises to turn big data’s open source poster child into a potential pitfall for vendor lock-in.

Can’t we all get along?

That’s the conclusion reached in a Gartner research note authored by Adrian. As he writes, “Hadoop security stacks emerging from three independent distributors remain immature and are not comprehensive; they are therefore likely to create incompatible, inflexible deployments and promote vendor lock-in.” This is, of course, standard operating procedure in databases or data warehouses, but it calls into question some of the benefit of building on an open source “standard” like Hadoop.

Ironically, it’s the very openness of Hadoop that creates this proprietary potential.

It starts with the inherent insecurity of Hadoop, which has come to light with recent ransomware attacks. Hadoop hasn’t traditionally come with built-in security, yet Hadoop systems “increase utilization of file system-based data that is not otherwise protected,” as Adrian explains, allowing “new vulnerabilities [to] emerge that compromise carefully crafted data security regimes.” It gets worse.

Organizations are increasingly turning to Hadoop to create “data lakes.” Unlike databases, which Adrian says tend to contain “known data that conforms to predetermined policies about quality, ownership, and standards,” data lakes encourage data of indeterminate quality or provenance. Though the Hadoop community has promising projects like Apache Eagle (which uses machine intelligence to identify security threats to Hadoop clusters), the Hadoop community has yet to offer a unified solution to lock down such data and, worse, is offering a mishmash of competing alternatives, as Adrian describes:

Big data security, in short, is a big mess.

Love that lock-in

The specter of lock-in is real, but is it scary? I’ve argued before that lock-in is a fact of enterprise IT, made no better (or worse) by open source … or cloud or any other trend in IT. Once an enterprise has invested money, people, and other resources into making a system work, it’s effectively locked in.

Still, there’s arguably more at stake when a company puts petabytes of data into a Hadoop data lake versus running an open source content management system or even an operating system. The heart of any business is its data, and getting boxed into a particular Hadoop vendor because an enterprise becomes dependent on its particular approach to securing Hadoop clusters seems like a big deal.

But is it really?

Oracle, after all, makes billions of dollars “locking in” customers to its very proprietary database, so much so that it had double the market share (41.6 percent) of its nearest competitor (Microsoft at 19.4 percent) as of April 2016, according to Gartner’s research. If enterprises are worried about lock-in, they have a weird way of showing it.

For me the bigger issue isn’t lock-in, but rather that the competing approaches to Hadoop security may actually yield poorer security, at least in the short term. The enterprises that deploy more than one Hadoop stack (a common occurrence) will need to juggle the conflicting security approaches and almost certainly leave holes. Those that standardize on one vendor will be stuck with incomplete security solutions.

Over time, this will improve. There’s simply too much money at stake for the on-prem and cloud-based Hadoop vendors. But for the moment, enterprises should continue to worry about Hadoop security.

Source: InfoWorld Big Data

Devs will lead us to the big data payoff at last

Devs will lead us to the big data payoff at last

In 2011, McKinsey & Co. published a study trumpeting that “the use of big data will underpin new waves of productivity growth and consumer surplus” and called out five areas ripe for a big data bonanza. In personal location data, for example, McKinsey projected a $600 billion increase in economic surplus for consumers. In health care, $300 billion in additional annual value was waiting for that next Hadoop batch process to run.

Five years later, according to a follow-up McKinsey report, we’re still waiting for the hype to be fulfilled. A big part of the problem, the report intones, is, well, us: “Developing the right business processes and building capabilities, including both data infrastructure and talent” is hard and mostly unrealized. All that work with Hadoop, Spark, Hive, Kafka, and so on has produced less benefit than we thought it would.

In part that’s because keeping up with all that open source software and stitching it together is a full-time job in itself. But you can also blame the bugbear that stalks every enterprise: institutional inertia. Not to worry, though: The same developers who made open source the lingua franca of enterprise development are now making big data a reality through the public cloud.

Paltry big data progress

On the surface the numbers look pretty good. According to a recent SyncSort survey, a majority (62 percent) are looking to Hadoop for advanced/predictive analytics with data discovery and visualization (57 percent) also commanding attention.

Yet when you examine this investment more closely, a comparatively modest return emerges in the real world. By McKinsey’s estimates, we’re still falling short for a variety of reasons:

  • Location-based data has seen 50 to 60 percent of potential value captured, mainly because not everyone can afford a GPS-enabled smartphone
  • In U.S. retail, we’re seeing 30 to 40 percent, due to a lack of analytical talent and an abundance of still-siloed data
  • Manufacturing comes in at 20 to 30 percent, again because data remains siloed in legacy IT systems and because management remains unconvinced that big data will drive big returns
  • U.S. health care limps along at a dismal 10 to 20 percent, beset by poor interoperability and data sharing, along with a paucity of proof that clinical utility will result
  • The E.U. public sector also lags at 10 to 20 percent, thanks to an analytics talent shortage and data siloed in various government agencies

These aren’t the only areas measured by McKinsey, but they provide a good sampling of big data’s impact across a range of industries. To date, that impact has been muted. This brings us to the most significant hole in big data’s process: culture. As the report authors describe:

Adapting to an era of data-driven decision making is not always a simple proposition. Some companies have invested heavily in technology but have not yet changed their organizations so they can make the most of these investments. Many are struggling to develop the talent, business processes, and organizational muscle to capture real value from analytics.

Given that people are the primary problem holding up big data’s progress, you could be forgiven for abandoning all hope.

Big data’s cloudy future

Nonetheless, things may be getting better. For example, in a recent AtScale survey of more than 2,500 data professionals across 1,400 companies and 77 countries, roughly 20 percent of respondents reported clusters of more than 100 nodes, a full 74 percent of which are in production. This represents double-digit year-over-year growth.

It’s even more encouraging to see where these nodes are running, which probably accounts for the increase in success rates. According to the same survey, more than half of respondents run their big data workloads in the cloud today and 72 percent plan to do so going forward. This aligns with anecdotal data from Gartner that interest in data lakes has mushroomed along with a propensity to build those lakes in public clouds.

This makes sense. Given that the very nature of data science — asking questions of our data to glean insight — requires a flexible approach, the infrastructure powering our big data workloads needs to enable this flexibility. In an interview, AWS product chief Matt Wood makes it clear that because “your resource mix is continually evolving, if you buy infrastructure it’s almost immediately irrelevant to your business because it’s frozen in time.”

Infrastructure elasticity is imperative to successful big data projects. Apparently more and more enterprises got this memo and are building accordingly. Perhaps not surprising, this shift in culture isn’t happening top-down; rather, it’s a bottom-up, developer-driven phenomenon.

What should enterprises do? Ironically, it’s more a matter of what they shouldn’t do: obstruct developers. In short, the best way to ensure an enterprise gets the most from its data is to get out of the way of its developers. They’re already taking advantage of the latest and greatest big data technologies in the cloud.

Source: InfoWorld Big Data

Who took the 'no' out of NoSQL?

Who took the 'no' out of NoSQL?

For years we’ve seen the database market split between the traditional relational database and new-school NoSQL databases. According to Gartner, however, these two worlds are heading toward further consolidation. As Gartner analyst Nick Huedecker opines, “Each week brings more SQL into the NoSQL market subsegment. The NoSQL term is less and less useful as a categorization.”

Yet that promised “consolidation” may not be all that Gartner predicts. If anything, we may be seeing NoSQL databases—rich in flexibility, horizontal scalability, and high performance—don enough of the RDBMS’s SQL clothing to ultimately displace the incumbents. But the “NoSQL vendor” most likely to dominate over the long term may surprise you.

NoSQL: Wrong name, right idea

“NoSQL” has always been somewhat of a misnomer, both because it purports to exclude SQL and because it lumps together very different databases under a common framework. A graph database like Neo4j, for example, is completely different from a columnar database like Cassandra.

What they share, however, is a three-fold focus, as Kelly Stirman, CMO at a stealth analytics startup and former MongoDB executive, told me in an interview. In his words, “NoSQL introduced three key innovations that the market has embraced and that the traditional vendors are working to add: 1) flexible data model, 2) distributed architecture (critical for cloud), and 3) flexible consistency models (critical for performance).”

Each element was critical to enabling modern, increasingly cloud-based applications, and each has presented traditional RDBMSes with a host of problems. Yes, most RDBMSes have implemented good enough but not great flexible data models. Yes, they’re also attempting flexible consistency models, with varying levels of (non)success. And, yes, they’re all trying to embrace a distributed architecture and finding it a brutally tough slog.

Even so, these attempts by the RDBMSes to become more NoSQL-like has led, in the words of DataStax chief evangelist Patrick McFadin in a conversation, to a “great convergence” that ultimately yields “multimodel” databases. Importantly, McFadin continued, this same convergence is taking place among the NoSQL databases as they add various components of the RDBMS in an attempt to hit massive mainstream adoption.

But make no mistake, such convergence is not without its problems.

Convergence interrupted

As Rohi Jain, CTO at Esgyn, describes it:

It is difficult enough for a query engine to support single operational, BI, or analytical workloads (as evidenced by the fact that there are different proprietary platforms supporting each). But for a query engine to serve all those workloads means it must support a wider variety of requirements than has been possible in the past. So, we are traversing new ground, one that is full of obstacles.

This inability to have one data model rule them all afflicts the RDBMS more than NoSQL, Mat Keep, director of product and market analysis at MongoDB, told me: “Relational databases have been trying to keep up with the times as well. But most of the changes they’ve made have been stopgaps–adding new data types rather than addressing the core inflexibility of the relational data model, for example.”

Meanwhile, he notes, “Our customers have a desire to stop managing many special snowflakes and converge on a single, integrated platform that provides all the new capabilities they want with the reliability and full features that they need.” DataStax has been doing the same with Cassandra, as both companies expand their NoSQL footprints with support for the likes of graph databases, but also going deeper on SQL with connectors that allow SQL queries to be translated into a language that document and columnar databases can understand.

None of these efforts really speaks to NoSQL’s long-term advantage over the venerable RDBMS. Everybody wants to speak SQL because that’s where the primary body of skills reside, given decades of enterprise build-up around SQL queries. But the biggest benefit of NoSQL, and the one that RDBMSes have failed to master, according to Stirman, is its distributed architecture.

Jared Rosoff, chief technologist of Cloud Native Apps at VMware, underlines this point: “Even if all the databases converged on SQL as query language, the NoSQL crowd benefits from a fundamentally distributed architecture that is hard for legacy engines to replace.” He continues, “How long is it going to get MySQL or Postgres or Oracle or SQL Server to support a 100-node distributed cluster?”

Though both the RDBMS and NoSQL camps have their challenges with convergence, “It’s way easier for the NoSQL crowd to become more SQL-like than it is for the SQL crowd to become more distributed” and “a fully SQL compliant database that doesn’t scale that well” will be inferior to “a fully distributed database that supports only some subset of SQL.”

In short, SQL is very useful but replaceable. Distributed computing in our big data world, quite frankly, is not.

Winner take some

In this world of imperfect convergence, NoSQL seems to have the winning hand. But which NoSQL vendor will ultimately dominate?

Early momentum goes to MongoDB and DataStax-fueled Cassandra, but Stirman suggests a different winner entirely:

What the market really wants is an open source database that is easy to use and flexible like MongoDB, scales like Cassandra, is battle hardened like Oracle, all without changing their security and tooling. MongoDB is best positioned to deliver this, but AWS is most likely to capture the market long term.

Yes, AWS, the same company that most threatens to own the Hadoop market, not to mention enterprise infrastructure generally. Amazon, the dominant force in the public cloud, is best positioned to capitalize on the enterprise shift toward the cloud and the distributed applications that live there. Database convergence, in sum, may ultimately be Bezos’ game to lose.

Source: InfoWorld Big Data

Hadoop, we hardly knew ye

Hadoop, we hardly knew ye

It wasn’t long ago that Hadoop was destined to be the Next Big Thing, driving the big data movement into every enterprise. Now there are clear signs that we’ve reached “peak Hadoop,” as Ovum analyst Tony Baer styles it. But the clearest indicator of all may simply be that “Hadoop” doesn’t actually have any Hadoop left in it.

Or, as InfoWorld’s Andrew Oliver says it, “The biggest thing you need to know about Hadoop is that it isn’t Hadoop anymore.”

Nowhere is this more true than in newfangled cloud workloads, which eschew Hadoop for fancier options like Spark. Indeed, as with so much else in enterprise IT, the cloud killed Hadoop. Or perhaps Hadoop, by moving too fast, killed Hadoop. Let me explain.

Is Hadoop and the cloud a thing of the past?

The fall of Hadoop has not been total, to be sure. As Baer notes, Hadoop’s “data management capabilities are not yet being matched by Spark or other fit-for-purpose big data cloud services.” Furthermore, as Oliver describes, “Even when you’re not using Hadoop because you’re focused on in-memory, real-time analytics with Spark, you still may end up using pieces of Hadoop here and there.”

By and large, however, Hadoop is looking decidedly retro in these cloudy days. Even the Hadoop vendors seem to have moved on. Sure, Cloudera still tells the world that Cloudera Enterprise is “powered by Apache Hadoop.” But if you look at the components of its cloud architecture, it’s not Hadoop all the way down. IBM, for its part, still runs Hadoop under the hood of its BigInsights product line, but if you use its sexier new Watson Data Platform, Hadoop is missing in action.

The reason? Cloud, of course.

As such, Baer is spot on to argue, “The fact that IBM is creating a cloud-based big data collaboration hub is not necessarily a question of Spark vs. Hadoop, but cloud vs. Hadoop.” Hadoop still has brand relevance as a marketing buzzword that signifies “big data,” but its component parts (HDFS, MapReduce, and YARN) are largely cast aside for newer and speedier cloud-friendly alternatives as applications increasingly inhabit the cloud.

Change is constant, but should it be?

Which is exactly as it should be, argues Hadoop creator Doug Cutting. Though Cutting has pooh-poohed the notion that Hadoop has been replaced by Spark or has lost its relevance, he also recognizes the strength that comes from software evolution. Commenting on someone’s observation that Cloudera’s cloud stack no longer has any Hadoop components in it, Cutting tweeted: “Proof that an open source platform evolves and improves more rapidly. Entire stack replacement in a decade! Wonderful to see.”

It’s easy to overlook what a powerful statement this is. If Cutting were a typical enterprise software vendor, not only would he not embrace the implicit accusation that his Hadoop baby is ugly (requiring replacement), but also he’d do everything possible to lock customers into his product. Software vendors get away with selling Soviet-era technology all the time, even as the market sweeps past them. Customers locked into long-term contracts simply can’t or don’t want to move as quickly as the market does.

For an open source project like Hadoop, however, there is no inhibition to evolution. In fact, the opposite is true: Sometimes the biggest problem with open source is that it moves far too quickly for the market to digest.

We’ve seen this to some extent with Hadoop, ironically. A year and a half ago, Gartner called out Hadoop adoption as “fairly anemic,” despite its outsized media attention. Other big data infrastructure quickly marched past it, including Spark, MongoDB, Cassandra, Kafka, and more.

Yet there’s a concern buried in this technological progress. One of the causes of Hadoop’s market adoption anemia has been its complexity. Hadoop skills have always fallen well short of Hadoop demand. Such complexity is arguably exacerbated by the fast-paced evolution of the big data stack. Yes, some of the component parts (like Spark) are easier to use, but not if they must be combined with an ever-changing assortment of other component parts.

In this way, we might have been better off with a longer shelf life for Hadoop, as we’ve had with Linux. Yes, in Linux the modules are constantly changing. But there’s a system-level fidelity that has enabled a “Linux admin” to actually mean something over decades, whereas keeping up with the various big data projects is much more difficult. In short, rapid Hadoop evolution is both testament to its flexibility and cause for concern.

Source: InfoWorld Big Data

Big data grab: Now they want your car's telemetry

Big data grab: Now they want your car's telemetry

A year ago the management consulting giant McKinsey & Co. predicted that the internet of things (IoT) could unlock $11 trillion in economic value by 2025. It’s a bold claim, particularly given that IoT currently proves more useful in launching massive DDoS attacks than in recognizing that I need to buy more milk.

Now, McKinsey has a new projection. It involves cars, and it declares that data “exhaust” from autos will be worth $750 billion by 2030. The consulting firm even goes so far as to lay out exactly how we can grab that revenue. If only it were as easy to make money off car data — which consumers may not want to share — as it is to prognosticate about it.

Follow these two easy steps

The automotive industry is huge, which is a big reason that Google, Apple, and others have been looking for opportunities to disrupt it in their favor. Given how much time we spend in our cars, particularly in North America, and how much data those cars generate, it’s easy to imagine massive new auto-related businesses built entirely on data. After all, Uber is a giant data-crunching company, not a cab company.

This isn’t simply a market for one Uber to dominate, suggests McKinsey in its new report, “Monetizing Car Data.” As the report authors conclude, the opportunity to monetize car data could be worth $450 billion to $750 billion within the next 13 years.

mckinsey auto data

The hitch is getting there. According to McKinsey’s analysis:

The opportunity for industry players hinges on their ability to 1) quickly build and test car data-driven products and services focused on appealing customer propositions and 2) develop new business models built on technological innovation, advanced capabilities, and partnerships that push the current boundaries of the automotive industry.

Let me paraphrase: $750 billion can be had by anyone who can 1) figure out cool new products that lots of people will want to buy and 2) sell those products in such a manner that people will pay for them. Um, thanks, McKinsey!

What the report doesn’t say is that auto exhaust, the data on which these hypothetical businesses will be based, may be a little more closely guarded than web exhaust.

Ideas are easy, execution is hard

In a rather blithe and generic manner, McKinsey gets one thing right about this new market: “The first challenge on the path towards car data monetization is communicating to the end customers exactly what is in it for them.”

On the web, the value proposition of giving up personal data in exchange for free stuff has simply become part of the furniture. The tech industry has no problem treating consumers as products. Last week, for example, Google (very quietly) changed its ad policies to enable much more invasive tracking of consumer behavior.

Will it be any different in Autopia?

Let’s assume for a minute that it will be. After all, data about where you go and how you drive generally has more serious implications than which websites you visit.

What’s the incentive for consumers to share that data? McKinsey lists a range of reasons, from consumers opting into proactive maintenance, better insurance rates, and more. However, these suggestions tend to overlook history: We haven’t generally been willing to proactively pay for security, we don’t like the idea of giving insurance companies the ability to lower our rates through data (because it will more likely result in raising our rates through that same data), and so on.

On the other hand, we may simply not care enough to stop it. The younger the demographic, the less likely it is to be concerned by privacy, the report unsurprisingly finds, while 90 percent of those surveyed by McKinsey are already aware that “data is openly accessible to applications and third parties.” Given that Pandora’s box filled with data is already open, it’s not surprising that 79 percent of those surveyed are “willing to consciously grant access to their data,” a percentage that has climbed 11 points since 2015.

Yet businesses still need to figure out how to monetize this willingness to trade data for services. Uber has already figured it out and presumably plenty more such companies are waiting to be born. The market for car data will likely be big, but capitalizing on it will plow through consumer privacy in ways hitherto unimagined.

Source: InfoWorld Big Data