MXNet review: Amazon's scalable deep learning

MXNet review: Amazon's scalable deep learning

Deep learning, which is basically neural network machine learning with multiple hidden layers, is all the rage—both for problems that justify the complexity and high computational cost of deep learning, such as image recognition and natural language parsing, and for problems that might be better served by careful data preparation and simple algorithms, such as forecasting the next quarter’s sales. If you actually need deep learning, there are many packages that could serve your needs: Google TensorFlow, Microsoft Cognitive Toolkit, Caffe, Theano, Torch, and MXNet, for starters.

I confess that I had never heard of MXNet (pronounced “mix-net”) before Amazon CTO Werner Vogels noted it in his blog. There he announced that in addition to supporting all of the deep learning packages I mentioned above, Amazon decided to contribute significantly to one in particular, MXNet, which it selected as its deep learning framework of choice. Vogels went on to explain why: MXNet combines the ability to scale to multiple GPUs (across multiple hosts) with good programmability and good portability.

Source: InfoWorld Big Data

8 big data predictions for 2017

8 big data predictions for 2017

Market research and advisory firm Ovum estimates the big data market will grow from $1.7 billion in 2016 to $9.4 billion by 2020. As the market grows, enterprise challenges are shifting, skills requirements are changing, and the vendor landscape is morphing. The coming year promises to be a busy one for big data pros. Here are some predictions from industry watchers and technology players.

1. Data scientist demand will wane

Demand for data scientists is softening, suggests Ovum in its report on big data trends. The research firm cites data from Indeed.com that shows flat demand for data scientists over the past four years. At the same time, colleges and universities are turning out a greater number of graduates with data science credentials.

Source: InfoWorld Big Data

Get started with Azure Machine Learning

Get started with Azure Machine Learning

Machine learning is fast becoming the go-to predictive paradigm for data scientists and developers alike. Of the many tools available for tapping neural networks, Microsoft’s Azure ML Studio offers a quick learning curve that won’t take deep data or coding chops to get up and running.

Microsoft Azure Machine Learning Studio is a cloud service for performing value prediction (regression), anomaly detection, structure discovery (clustering), and category prediction (classification). While my previous tutorial for TensorFlow revealed how Google’s open source machine learning and deep neural network library requires you to roll up your sleeves a bit before digging in, Azure ML Studio’s graphical, modular approach will have you testing machine learning models quickly, as you will see below.

Let’s get started.

Source: InfoWorld Big Data

Move over Memcached and Redis, here comes Netflix's Hollow

Move over Memcached and Redis, here comes Netflix's Hollow

After two years of internal use, Netflix is offering a new open source project as a powerful option to cache data sets that change constantly.

Hollow is a Java library and toolset aimed at in-memory caching of data sets up to several gigabytes in size. Netflix says Hollow’s purpose is threefold: It’s intended to be more efficient at storing data; it can provide tools to automatically generate APIs for convenient access to the data; and it can automatically analyze data use patterns to more efficiently synchronize with the back end.

Let’s keep this between us

Most of the scenarios for caching data on a system where it isn’t stored—a “consumer” system rather than a “producer” system—involve using a product like Memcached or Redis. Hollow is reminiscent of both products since it uses in-memory storage for fast access, but it isn’t an actual data store like Redis.

Unlike many other data caching systems, Hollow is intended to be coupled to a specific data set—a given schema with certain fields, typically a JSON stream. This requires some prep work, although Hollow provides some tools to partly automate the process. The reason for doing so: Hollow can store the data in-memory as fixed-length, strongly typed chunks that aren’t subject to Java’s garbage collection. As a result, they’re faster to access than conventional Java objects.

Another purported boon with Hollow is that it provides a gamut of tooling for working with the data. Once you’ve defined a schema for the data, Hollow can automatically produce a Java API that can supply autocomplete data to an IDE. The data can also be tracked as it changes, so developers have access to point-in-time snapshots, differences between snapshots, and data rollbacks.

Faster all around

A lot of the advantages Netflix claims for Hollow involve basic operational efficiency—namely, faster startup time for servers and less memory churn. But Hollow’s data modeling and management tools are also meant to help with development, not simply speed production.

“Imagine being able to quickly shunt your entire production data set—current or from any point in the recent past—down to a local development workstation, load it, then exactly reproduce specific production scenarios,” Netflix says in its introductory blog post.

One caveat is that Hollow isn’t suited for data sets of all sizes—“KB, MB, and GB, but not TB,” is how the company puts it in its documentation. That said, Netflix also implies that Hollow reduces the amount of sprawl required by a cached data set. “With the right framework, and a little bit of data modeling, that [memory] threshold is likely much higher than you think,” Netflix writes.

Source: InfoWorld Big Data

AI is coming, and will take some jobs, but no need to worry

AI is coming, and will take some jobs, but no need to worry

The capabilities of artificial intelligence and machine learning are accelerating, and many cybersecurity tasks currently performed by humans will be automated. There will still be plenty of work to go around so job prospects should remain good, especially for those who keep up with technology, broaden their skill sets, and get a better understanding of their company’s business needs.

Cybersecurity jobs won’t go the way of telephone operators. Take, for example, Spain-based antivirus company Panda Security. When the company first started, there were a number of people reverse-engineering malicious code and writing signatures.

“If we still were working in the same way, we’d need hundreds of thousands of engineers,” said Luis Corrons, technical director at PandaLabs.

Instead, the company’s researchers created tools that do most of those jobs.

“That means that nowadays we only have to take a look at a tiny portion of the new malicious code that shows up every day—more than 200,000 new malware samples per day. I cannot imagine how we could do our main task, protecting our customers, without AI.”

Does that mean that hundreds of thousands of engineering jobs have been destroyed? Of course not, he said.

“Being realistic, no company could afford that,” he said.

In fact, AI has actually created new jobs, he said, including those of improving internal systems and creating news ones, and jobs for mathematicians applying AI to those systems.

“I get asked a lot by parents and college students about where they should be focusing, and security is where I think there are a lot of opportunities,” said Karin Klein, founding partner at Bloomberg Beta, Bloomberg’s venture fund that invests in early-stage tech companies.

There’s a great shortage of talent in the industry, and a growing need for security professionals, she said.

AI tools will put more power in your hands

AI promises to automate repetitive tasks and those that require the processing of large amounts of information.

But the industry needs that, since there’s too much for humans to process on their own.

“It’s more about augmentation rather than automation,” said Klein.

That’s been a common theme for the cybersecurity companies she’s been investing in, she said, adding that she is very optimistic about what the AI technology will bring.

“It’s going to help that over-stressed IT guy who is trying to manage everything,” said Dale Meredith, author and cybersecurity trainer at Pluralsight. “It’s going to help him have more time to look at what’s important for the company.”

AI is just another tool, he said.

“And it’s coming along at the right time,” he added. “Think of the amount of data we have now compared to just five years ago.”

New technologies, like the Internet of Things, promise to generate even more data, said Jason Hong, a professor in Carnegie Mellon’s School of Computer Science and an expert in AI and cyber security.

Peter Metzger, vice chairman and cybersecurity and business risk expert at DHR International

“Almost every aspect, every dimension of society now relies on computers, and the need for security keeps on growing,” he said.

That will allow individual analysts to do more than they can today, and do it more effectively.

“In the near term there are still plenty of positions and not enough professionals,” said Bryan Ware, CEO at Haystax Technology. “But over time, will AI will allow analysts to be more productive, automating low level tasks and intelligently alerting the analyst.”

For example, better AI will make it easier for security professionals to sort through mountains of noise to find actual indicators of compromise, said David Campbell, CSO at SendGrid, a Denver marketing company that suffered a breach last year.

“AI will help speed the identification and prediction of security breaches,” he said. “This will bolster career prospects for security professionals that are adept at divergent thinking, and limit career prospects for more traditional SOC analysts that respond to alerts without considering the larger picture.”

With AI automating out the horrible, routine, cutting-and-pasting jobs, most of the growth in the cybersecurity profession will be in forensic investigations, said Kris Lovejoy, CEO at security firm Acuity Solutions.

That may require additional training, she said—not necessarily a full university course, but something like a SANS training program.

“The security field currently requires lots and lots of manual labor,” she said. “You’ve got folks doing either very entry-level jobs, almost IT administration, and very sophisticated folks with lots of education spending 80 percent of their time waiting for something to load.”

That gets frustrating and burns people out. With automation, the jobs are going to become more interesting—and there might be less churn in the profession as a result, she said.

There will also be new job opportunities when it comes to properly deploying AI tools.

“AI isn’t free,” said Haystax’s Ware. “Many techniques require significant algorithm training, data mark up, and testing that has to be done by humans.”

The care and feeding of AI also involves ensuring that the AIs have highly available, highly secure infrastructure on which to run, said David Molnar, IEEE member and senior researcher at Microsoft.

“Highly available infrastructure because if the AI stops, the business suffers,” he said. “Security, because if the AI gets bad data or the AI is hacked, then the business makes bad decisions.”

The CSO’s job will increasingly be about protecting the AI’s role in business, and understanding the processes around the AI.

And the CSO might also need to act as a mediator between the AI and the rest of the company.

“To establish legitimacy for an AI driven decision, the CSO must help the rest of the business leaders advocate and explain that process to the world,” he said. “It isn’t going to be easy, but it will put the CSO at the heart of every business.”

Finding ways to apply AI to a business will also require a different way of thinking.

“A successful AI strategy requires very multi-disciplinary skills,” said Hossein Rahnama, CEO and founder at Flybits, and a visiting scholar at the Human Dynamics group at the MIT Media Lab.

“Many AI experts are very much siloed in the past, and lack the experience of communicating business use cases. Translating AI research into business value is something very important.”

To get training in this area, he recommends looking at programs that combine a foundational understanding of AI with an understanding of public policy implications.

“There are a number of universities working on programs directly addressing those needs,” he said. “Stanford is looking there, and there are some interesting initiatives at MIT.”

There are also learning opportunities available beyond traditional colleges and universities and training institutes, said Kunal Anand, CTO and co-founder at security firm Prevoty.

He recommends attending conferences around machine learning and data science, and subscribing to blogs and mailing lists.

“And look at open source projects,” he added. “The best way to learn is to build.”

Branching out

Security analysts typically don’t have to write new code at their jobs. But there could be more opportunities to do that in the future.

“Learn to code,” said SendGrid’s Campbell. “Professionals seeking careers in security will need to be able to code in order to be successful.”

He suggested languages like Python, Ruby and Node.js.

“Being able to code and interpret these languages will help career prospects differentiate themselves and provide greater value for organizations looking to automate security tasks,” he said.

On a higher level as well, security professionals can help improve their companies’ software. Automated tools can spot common vulnerabilities, but it takes a human to understand logical flaws, said Giovanni Vigna, co-founder and CTO at Lastline.

“For example, the fact that a coupon in an e-commerce application should be applicable only once is something that is immediately obvious to a human,” he said.

That might not be, strictly speaking, a technical vulnerability, but it is a security issue, and requires human judgment, and imagination, to understand.

“No amount of AI would allow a program to understand what a program does in every case,” Vigna said. “It’s actually a fundamental theorem of computer science, called ‘The Halting Problem’.”

Computers will also lag behind in leading and innovating, said Peter Metzger, vice chairman and cybersecurity and business risk expert at DHR International, an executive search firm.

“We’re still going to need people to lead, decide, and get things done,” he said.

Providing business value

As the routine tasks get automated, humans will be able to focus on making strategic, values-driven decisions.

That will require a true understanding of the business, and how that is intertwined with technology, said Diana Kelley, global executive security adviser for IBM Security.

“I recommend that cybersecurity pros beef up their 360-degree skills,” she said. “Get an understanding of the business, an understanding of the stakeholders in their work.”

That could involve in working closely with the legal department and understanding what they do, or helping with media outreach or marketing.

“This is extremely challenging and difficult,” she said. “But to be valuable, you need to understand how people are interacting with their technology. Cybersecurity is a very fascinating area that is very horizontal, it goes through all the areas of a business.”

Another area that cybersecurity pros can look at is that of education.

“Humans are great at explaining things to other humans,” she said. “That is something that we see at IBM. Someone who can explain things in a clear way that someone else can understand can be very valuable, not just for other security professionals, but also for a general audience, too.”

And if the education task involves teaching the AI systems how to do cybersecurity, InfoSec experts shouldn’t be worried that they are a traitor to humanity, she added.

“You’re a helper to humanity,” she said. “There is so much data and it’s so hard to keep up with it that this is about throwing out that life jacket, helping people to float. It’s not about getting rid of humans. It’s about making our existing humans super-human.”

This story, “AI is coming, and will take some jobs, but no need to worry” was originally published by CSO.

Source: InfoWorld Big Data

Who took the 'no' out of NoSQL?

Who took the 'no' out of NoSQL?

For years we’ve seen the database market split between the traditional relational database and new-school NoSQL databases. According to Gartner, however, these two worlds are heading toward further consolidation. As Gartner analyst Nick Huedecker opines, “Each week brings more SQL into the NoSQL market subsegment. The NoSQL term is less and less useful as a categorization.”

Yet that promised “consolidation” may not be all that Gartner predicts. If anything, we may be seeing NoSQL databases—rich in flexibility, horizontal scalability, and high performance—don enough of the RDBMS’s SQL clothing to ultimately displace the incumbents. But the “NoSQL vendor” most likely to dominate over the long term may surprise you.

NoSQL: Wrong name, right idea

“NoSQL” has always been somewhat of a misnomer, both because it purports to exclude SQL and because it lumps together very different databases under a common framework. A graph database like Neo4j, for example, is completely different from a columnar database like Cassandra.

What they share, however, is a three-fold focus, as Kelly Stirman, CMO at a stealth analytics startup and former MongoDB executive, told me in an interview. In his words, “NoSQL introduced three key innovations that the market has embraced and that the traditional vendors are working to add: 1) flexible data model, 2) distributed architecture (critical for cloud), and 3) flexible consistency models (critical for performance).”

Each element was critical to enabling modern, increasingly cloud-based applications, and each has presented traditional RDBMSes with a host of problems. Yes, most RDBMSes have implemented good enough but not great flexible data models. Yes, they’re also attempting flexible consistency models, with varying levels of (non)success. And, yes, they’re all trying to embrace a distributed architecture and finding it a brutally tough slog.

Even so, these attempts by the RDBMSes to become more NoSQL-like has led, in the words of DataStax chief evangelist Patrick McFadin in a conversation, to a “great convergence” that ultimately yields “multimodel” databases. Importantly, McFadin continued, this same convergence is taking place among the NoSQL databases as they add various components of the RDBMS in an attempt to hit massive mainstream adoption.

But make no mistake, such convergence is not without its problems.

Convergence interrupted

As Rohi Jain, CTO at Esgyn, describes it:

It is difficult enough for a query engine to support single operational, BI, or analytical workloads (as evidenced by the fact that there are different proprietary platforms supporting each). But for a query engine to serve all those workloads means it must support a wider variety of requirements than has been possible in the past. So, we are traversing new ground, one that is full of obstacles.

This inability to have one data model rule them all afflicts the RDBMS more than NoSQL, Mat Keep, director of product and market analysis at MongoDB, told me: “Relational databases have been trying to keep up with the times as well. But most of the changes they’ve made have been stopgaps–adding new data types rather than addressing the core inflexibility of the relational data model, for example.”

Meanwhile, he notes, “Our customers have a desire to stop managing many special snowflakes and converge on a single, integrated platform that provides all the new capabilities they want with the reliability and full features that they need.” DataStax has been doing the same with Cassandra, as both companies expand their NoSQL footprints with support for the likes of graph databases, but also going deeper on SQL with connectors that allow SQL queries to be translated into a language that document and columnar databases can understand.

None of these efforts really speaks to NoSQL’s long-term advantage over the venerable RDBMS. Everybody wants to speak SQL because that’s where the primary body of skills reside, given decades of enterprise build-up around SQL queries. But the biggest benefit of NoSQL, and the one that RDBMSes have failed to master, according to Stirman, is its distributed architecture.

Jared Rosoff, chief technologist of Cloud Native Apps at VMware, underlines this point: “Even if all the databases converged on SQL as query language, the NoSQL crowd benefits from a fundamentally distributed architecture that is hard for legacy engines to replace.” He continues, “How long is it going to get MySQL or Postgres or Oracle or SQL Server to support a 100-node distributed cluster?”

Though both the RDBMS and NoSQL camps have their challenges with convergence, “It’s way easier for the NoSQL crowd to become more SQL-like than it is for the SQL crowd to become more distributed” and “a fully SQL compliant database that doesn’t scale that well” will be inferior to “a fully distributed database that supports only some subset of SQL.”

In short, SQL is very useful but replaceable. Distributed computing in our big data world, quite frankly, is not.

Winner take some

In this world of imperfect convergence, NoSQL seems to have the winning hand. But which NoSQL vendor will ultimately dominate?

Early momentum goes to MongoDB and DataStax-fueled Cassandra, but Stirman suggests a different winner entirely:

What the market really wants is an open source database that is easy to use and flexible like MongoDB, scales like Cassandra, is battle hardened like Oracle, all without changing their security and tooling. MongoDB is best positioned to deliver this, but AWS is most likely to capture the market long term.

Yes, AWS, the same company that most threatens to own the Hadoop market, not to mention enterprise infrastructure generally. Amazon, the dominant force in the public cloud, is best positioned to capitalize on the enterprise shift toward the cloud and the distributed applications that live there. Database convergence, in sum, may ultimately be Bezos’ game to lose.

Source: InfoWorld Big Data

AeroVironment's Quantix drone is all about the data

AeroVironment's Quantix drone is all about the data

In the age of technology, businesses are all chasing efficiency. That’s exactly what AeroVironment promises to deliver with its new Quantix drone.

The technology, a combination of a drone and cloud-based analysis service, can be useful for farmers, says Steve Gitlin, vice president of corporate strategy at AeroVironment.

“In many cases, farmers rely on themselves or their people to walk the fields, and if they’re managing large fields in excess of 100 acres or so, then it’s very difficult to walk the entire field in any given unit of time. So they have to rely on their deep experience and sampling.”

Equipped with RBG and multispectral cameras, Quantix is capable of covering 400 acres of land during a single flight, all the while collecting high-resolution images. The data can be instantly analyzed on the included tablet, which is also used to launch and land the drone with the click of a button.

For a deeper analysis, customers can log into AeroVironment’s cloud service called Decision Support System (DSS), which is compatible with many of the company’s other unmanned systems.

Quantix takes off and lands vertically, making it easy to operate, but transitions to horizontal flight in the air, which gives it a longer range. In the United States, the Federal Aviation Administration still requires that drones fly in operators’ line of sight, but if the regulations are loosened, Quantix could be useful for pipeline, road, and power line inspections, Gitlin says, because the drone can cover 40 linear miles in less than an hour. 

Quantix will be available in the spring of 2017. A price has not yet been announced. 

Source: InfoWorld Big Data

10 things you need to worry about in 2017

10 things you need to worry about in 2017

Each year, including last year, I’ve supplied you with “areas of concern”—that is, stuff that might not go well for you or our comrades in the coming 12 months. I’m happy to oblige once again this year with 10 items that may go bump in the night.

Hadoop distributions

Big data, analytics, and machine learning are alive and well, and they’ll eventually transform business in most of the ways they’ve promised. But the big, fat Hadoop distribution is probably toast.

This isn’t to say everyone involved is in trouble, but we’re looking at more of an à la carte situation, or at least a buffet, where you don’t have to swallow the whole elephant. Burned by projects that never completed or met their promise in previous years, companies will be more reluctant to bite off the whole dish and instead look at what they’re trying to do and actually need at the infrastructure level. Technology companies that can adapt to this reality will make even more money.

Hadoop vendors

Three major Hadoop vendors along with big “do everything companies” (especially the Big Blue one) are in this game. We already saw Pivotal essentially exit. It’s hard to see the market continue to support three Hadoop vendors. See the above item to figure out who I’m betting on.

Oracle 

Oracle likes to buy companies. It helps make up for the fact that the core Oracle database is old and clunky, and Oracle doesn’t make anything new or great. If it buys something you use, expect the price to go up. Oracle loves the long tail, particularly entrenched, hard-to-remove, older technology. Once it’s in the company’s clutches, you get that famed Oracle technical support, too.

Databricks

Something will change at Databricks, the cloud company built around Spark, the open source distributed computing framework that has essentially supplanted Hadoop. While Spark is great, the Databricks business model isn’t as compelling, and it seems easily disrupted by one of the big three cloud vendors. The company is run by academics, and it needs hard-knuckled business types to sort out its affairs. I hope the change won’t be too disruptive to Spark’s development—and can be accomplished without hurt feelings, so we don’t lose progress.

Deregulation

Now that we have the Trumpocalypse to look forward to, you can expect “deregulation” of everything, from unlimited poison in your groundwater to the death of Net neutrality. Lest you think that will boost the tech economy, note that software vendors make big money selling compliance solutions, fewer of which will be necessary. Also, the Affordable Care Act (Obamacare) and electronic medical/health records have been a boon for tech. Some of Obamacare may remain, but very likely the digital transformation of health will be scaled way back.

Clinton’s plans had their own problems, but regardless of where you stand politically, the Trump presidency will hit us where it hurts—especially after California secedes. (Or will there be six Californias?)

Game consoles

How is this related to enterprise software? Well, the game industry is a good chunk of the tech sector, and some giants depend on console games as blockbusters. Game consoles are specialized computers with a very specific programming models and guaranteed upgrades. Everyone is doing “pro” versions to get shorter-term revenue grabs—instead of waiting, say, seven years to sell new consoles—which comes at the cost of a stable platform that game developers can depend on.

Meanwhile, mobile games are huge, Steam keeps rising, and people are playing computer games again. I suspect this will start to depress the console business. Game developers will struggle with how many platforms they need to keep up with, and some giants will stumble.

Yet another hacking scandal

Once again, tech, government, and business will fail to learn the lesson that security can’t be bought and deployed like a product. They will persist in hiring the cheapest developers they can find, flail at project management, and suffer nonexistent or hapless QA. If a program runs, then it has stmt.execute(“select something from whatever where bla =”+ sql_injection_opportunity) throughout the code. That’s in business—government is at least 20 years behind. Sure, we’re giving Putin a big hug, but don’t expect him to stop hacking us.

The economy

It seems like the Great Recession was just yesterday, but we’re due for another. At the same time, we don’t have a lot of big, new enterprise tech to brag about. I’m not saying it’s time to climb in the lifeboat, but you might want to make sure you have a safety net in case we’re hit with another downturn. My guess is it will be smaller than the dot-bomb collapse, so don’t fret too much.

Telco-cable mergers

With Google dialing back Google Fiber and an impending AT&T-Time Warner merger, our overpriced connections to the internet are unlikely to get cheaper—and speed increases will probably be less frequent.

Your math skills

Thanks to machine learning, it will be harder to command a six-figure developer salary without a mathematical background. As companies figure out what machine learning is and what it can do, before paying a premium for talent, they’ll start to require that developers understand probability, linear algebra, multivariable calculus, and all that junk. For garden-variety programming, they’ll continue to accelerate their plan to buy talent in “low-cost countries.”

Now let’s crank it to 11: As you may have heard, we’ve elected a narcissistic agent of the white supremacist (now rebranded “alt-right”) movement who doesn’t even know how to use a computer, and we’ve put him in charge of the nukes. This is going to be a disaster for everyone, of course, but for tech in particular if we all survive. But hey, next week I’ll try looking on the bright side.

Source: InfoWorld Big Data

Hadoop, we hardly knew ye

Hadoop, we hardly knew ye

It wasn’t long ago that Hadoop was destined to be the Next Big Thing, driving the big data movement into every enterprise. Now there are clear signs that we’ve reached “peak Hadoop,” as Ovum analyst Tony Baer styles it. But the clearest indicator of all may simply be that “Hadoop” doesn’t actually have any Hadoop left in it.

Or, as InfoWorld’s Andrew Oliver says it, “The biggest thing you need to know about Hadoop is that it isn’t Hadoop anymore.”

Nowhere is this more true than in newfangled cloud workloads, which eschew Hadoop for fancier options like Spark. Indeed, as with so much else in enterprise IT, the cloud killed Hadoop. Or perhaps Hadoop, by moving too fast, killed Hadoop. Let me explain.

Is Hadoop and the cloud a thing of the past?

The fall of Hadoop has not been total, to be sure. As Baer notes, Hadoop’s “data management capabilities are not yet being matched by Spark or other fit-for-purpose big data cloud services.” Furthermore, as Oliver describes, “Even when you’re not using Hadoop because you’re focused on in-memory, real-time analytics with Spark, you still may end up using pieces of Hadoop here and there.”

By and large, however, Hadoop is looking decidedly retro in these cloudy days. Even the Hadoop vendors seem to have moved on. Sure, Cloudera still tells the world that Cloudera Enterprise is “powered by Apache Hadoop.” But if you look at the components of its cloud architecture, it’s not Hadoop all the way down. IBM, for its part, still runs Hadoop under the hood of its BigInsights product line, but if you use its sexier new Watson Data Platform, Hadoop is missing in action.

The reason? Cloud, of course.

As such, Baer is spot on to argue, “The fact that IBM is creating a cloud-based big data collaboration hub is not necessarily a question of Spark vs. Hadoop, but cloud vs. Hadoop.” Hadoop still has brand relevance as a marketing buzzword that signifies “big data,” but its component parts (HDFS, MapReduce, and YARN) are largely cast aside for newer and speedier cloud-friendly alternatives as applications increasingly inhabit the cloud.

Change is constant, but should it be?

Which is exactly as it should be, argues Hadoop creator Doug Cutting. Though Cutting has pooh-poohed the notion that Hadoop has been replaced by Spark or has lost its relevance, he also recognizes the strength that comes from software evolution. Commenting on someone’s observation that Cloudera’s cloud stack no longer has any Hadoop components in it, Cutting tweeted: “Proof that an open source platform evolves and improves more rapidly. Entire stack replacement in a decade! Wonderful to see.”

It’s easy to overlook what a powerful statement this is. If Cutting were a typical enterprise software vendor, not only would he not embrace the implicit accusation that his Hadoop baby is ugly (requiring replacement), but also he’d do everything possible to lock customers into his product. Software vendors get away with selling Soviet-era technology all the time, even as the market sweeps past them. Customers locked into long-term contracts simply can’t or don’t want to move as quickly as the market does.

For an open source project like Hadoop, however, there is no inhibition to evolution. In fact, the opposite is true: Sometimes the biggest problem with open source is that it moves far too quickly for the market to digest.

We’ve seen this to some extent with Hadoop, ironically. A year and a half ago, Gartner called out Hadoop adoption as “fairly anemic,” despite its outsized media attention. Other big data infrastructure quickly marched past it, including Spark, MongoDB, Cassandra, Kafka, and more.

Yet there’s a concern buried in this technological progress. One of the causes of Hadoop’s market adoption anemia has been its complexity. Hadoop skills have always fallen well short of Hadoop demand. Such complexity is arguably exacerbated by the fast-paced evolution of the big data stack. Yes, some of the component parts (like Spark) are easier to use, but not if they must be combined with an ever-changing assortment of other component parts.

In this way, we might have been better off with a longer shelf life for Hadoop, as we’ve had with Linux. Yes, in Linux the modules are constantly changing. But there’s a system-level fidelity that has enabled a “Linux admin” to actually mean something over decades, whereas keeping up with the various big data projects is much more difficult. In short, rapid Hadoop evolution is both testament to its flexibility and cause for concern.

Source: InfoWorld Big Data

Review: Spark lights up machine learning

Review: Spark lights up machine learning

As I wrote in March of this year, the Databricks service is an excellent product for data scientists. It has a full assortment of ingestion, feature selection, model building, and evaluation functions, plus great integration with data sources and excellent scalability. The Databricks service provides a superset of Spark as a cloud service. Databricks the company was founded by the original developer of Spark, Matei Zaharia, and others from U.C. Berkeley’s AMPLab. Meanwhile, Databricks continues to be a major contributor to the Apache Spark project.

In this review, I’ll discuss Spark ML, the open source machine learning library for Spark. To be more accurate, Spark ML is the newer of two machine learning libraries for Spark. As of Spark 1.6, the DataFrame-based API in the Spark ML package was recommended over the RDD-based API in the Spark MLlib package for most functionality, but was incomplete. Now, as of Spark 2.0, Spark ML is primary and complete and Spark MLlib is in maintenance mode.

Source: InfoWorld Big Data