3 big data platforms look beyond Hadoop

3 big data platforms look beyond Hadoop

A distributed file system, a MapReduce programming framework, and an extended family of tools for processing huge data sets on large clusters of commodity hardware, Hadoop has been synonymous with “big data” for more than a decade. But no technology can hold the spotlight forever.

While Hadoop remains an essential part of the big data platforms, and the major Hadoop vendors—namely Cloudera, Hortonworks, and MapR—have changed their platforms dramatically. Once-peripheral projects like Apache Spark and Apache Kafka have become the new stars, and the focus has turned to other ways to drill into data and extract insight. 

Let’s take a brief tour of the three leading big data platforms, what each adds to the mix of Hadoop technologies to set it apart, and how they are evolving to embrace a new era of containers, Kubernetes, machine learning, and deep learning.

Cloudera Enterprise Data Hub

Cloudera was the first to market with a Hadoop distribution—not surprising given that its core team consisted of engineers who had leveraged Hadoop in places like Yahoo, Google, and Facebook. Hadoop co-creator Doug Cutting serves as chief architect. 

Source: InfoWorld Big Data

IDG Contributor Network: Data lakes: Just a swamp without data governance and catalog

IDG Contributor Network: Data lakes: Just a swamp without data governance and catalog

The big data landscape has exploded in an incredibly short amount of time. It was just in 2013 that the term “big data” was added to the pages of the Oxford English Dictionary. Fewer than five years later, 2.5 quintillion bytes of data is being generated every day. In response to the creation of such vast amounts of raw data, many businesses recognized the need to provide significant data storage solutions such as data warehouses and data lakes without much thought.

On the surface, more modernized data lakes hold an ocean of possibility for organizations eager to put analytics to work. They offer a storage repository for those capitalizing on new transformative data initiatives and capturing vast amounts of data from disparate sources (including social, mobile, cloud applications, and the internet of things). Unlike the old data warehouse, the data lake holds “raw” data in its native format, including structured, semistructured, and unstructured data. The data structure and requirements are not defined until the data is needed.

One of the most common challenges organizations face, though, with their data lakes is the inability to find, understand, and trust the data they need for business value or to gain a competitive edge. That’s because the data might be gibberish (in its native format)—or even conflicting. When the data scientist wants to access enterprise data for modeling or to deliver insights for analytics teams, this person is forced to dive into the depths of the data lake, and wade through the murkiness of undefined data sets from multiple sources. As data becomes an increasingly more important tool for businesses, this scenario is clearly not sustainable in the long run.

To be clear, for businesses to effectively and efficiently maximize data stored in data lakes, they need to add context to their data by implementing policy-driven processes that classify and identify what information is in the lake, and why it’s in there, what it means, who owns it, and who is using it. This can best be accomplished through data governance integrated with a data catalog. Once this is done, the murky data lake will become crystal clear, particularly for the users who need it most.

Avoiding the data swamp

The potential of big data is virtually limitless. It can help businesses scale more efficiently, gain an advantage over their competitors, enhance customer service, and more. It may seem, the more data an organization has at its fingertips, the better. Yet that’s not necessarily the case—especially if that data is hidden in the data lake with no governance in place. A data lake without data governance will ultimately end up being a collection of disconnected data pools or information silos—just all in one place.

Data dumped into a data lake is not of business value without structure, processes, and rules around the data. Ungoverned, noncataloged data leaves businesses vulnerable. Users won’t know where the data comes from, where it’s been, with whom they can share it, or if it’s certified. Regulatory and privacy compliance risks are magnified, and data definitions can change without any user’s knowledge. The data could be impossible to analyze or be used inappropriately because there are inaccuracies and/or the data is missing context.

The impact: stakeholders won’t trust results gathered from the data. A lack of data governance transforms a data lake from a business asset to a murky business liability.

The value of a data catalog in maintaining a crystal-clear data lake

The tremendous volume and variety of big data across an enterprise makes it difficult to understand the data’s origin, format, lineage, and how it is organized, classified, and connected. Because data is dynamic, understanding all of its features is essential to its quality, usage, and context. Data governance provides systematic structure and management to data residing in the data lake, making it more accessible and meaningful.

An integrated data governance program that includes a data catalog turns a dark, gloomy data lake into a crystal-clear body of data that is consistently accessible to be consumed, analyzed, and used. Its wide audience of users can glean new insights and solve problems across their organization. A data catalog’s tagging system methodically unites all the data through the creation and implementation of a common language, which includes data and data sets, glossaries, definitions, reports, metrics, dashboards, algorithms, and models. This unifying language allows users to understand the data in business terms, while also establishing relationships and associations between data sets.

Data catalogs make it easier for users to drive innovation and achieve groundbreaking results. Users are no longer forced to play hide-and-seek in the depths of a data lake to uncover data that fits their business purpose. Intuitive data search through a data catalog enables users to find and “shop” for data in one central location using familiar business terms and filters that narrow results to isolate the right data. Similar to sites like Amazon.com, enhanced data catalogs incorporate machine learning, which learns from past user behavior, to issue recommendations on other valuable data sets for users to consider. Data catalogs even make it possible to alert users when data that’s relevant to their work is ingested in the data lake.

A data catalog combined with governance also ensures trustworthiness of the data. A data lake with governance provides assurance that the data is accurate, reliable, and of high quality. The catalog then authenticates the data stored in the lake using structured workflows and role-based approvals of data sources. And it helps users understand the data journey, its source, lineage, and transformations so they can assess its usefulness.

A data catalog helps data citizens (anyone within the organization who uses data to perform their job) gain control over the glut of information stuffed into their data lakes. By indexing the data and linking it to agreed-upon definitions about quality, trustworthiness, and use, a catalog helps users determine which data is fit to use—and which they should discard because it’s incomplete or irrelevant to the analysis at hand.

Whether users are looking to preview sample data or determine how new data projects might impact downstream processes and reports, a data catalog gives them the confidence that they’re using the right data and that it adheres with provider and organizational policies and regulations. Added protections allow for sensitive data to be flagged within a data lake and security protocols can prevent unauthorized users from accessing it.

Realizing data’s potential requires more than just the collection of it in a data lake. Data must be meaningful, consistent, clear, and most important, be cataloged for the users who need it the most. Proper data governance and a first-rate data catalog will transform your data lake from simply being a data repository to a dynamic tool and collaborative workspace that empowers digital transformation across your enterprise.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

How to get real value from big data in the cloud

How to get real value from big data in the cloud

According to a recent report from IDC, “worldwide revenues for big data and business analytics will grow from nearly $122 billion in 2015 to more than $187 billion in 2019, an increase of more than 50 percent over the five-year forecast period.”

Anyone in enterprise IT already knows that big data is a big deal. If you can manage and analyze massive amounts of data—I’m talking petabytes—you’ll have access to all sorts of information that will help you run your business better. 

Right? Sadly, for most enterprises, no. 

Here are some hard facts: Cloud computing made big data affordable. Before, you would have to build a new datacenter to house the consolidation of data. Now, you can consolidate data in the cloud, at bargain prices.

How’s that working out? I’m finding that it’s one thing to have both structured and unstructured data in a central location. It’s another thing to make good use of that data for both tactical and strategic reasons.

Too often, enterprises pull together the data but don’t know what to do with it. They lack a systemic understanding of the business opportunities and values that could be gained by leveraging this data. 

What’s often lacking is a data plan. I recommend that every enterprise have a completed data plan before the data is even consolidated in the cloud. This means having a clear and detailed set of use cases for the data (including purpose and value), as well as a list of tools and technologies (such as machine learning and data analytics) that will be used to get the business value out of the data.

The data plan needs to be done before the consolidation for several reasons:

  • Know what data will be leveraged for analytical purposes. I find that that some data that is consolidated is not needed. So you end up paying for database storage for no sound business purpose, as well as hurting analysis performance because the unnecessary data needs to be processed as well. 
  • Understand the meaning of the data, including metadata. This assures that you’re analyzing the right data for the use cases. 
  • Consider a performance plan. If you sort through petabytes of data, that’s a lot of time and cloud dollars spent. How can you optimize?  
  • Have a sound list of data analytics tools. Although many enterprises purchase the most popular tools, you may find that your big data journey takes you to less popular technology that is a better fit. Be sure to explore the market before deciding on your tool set.

A little planning goes a long way. Your business is worth that investment. 

Source: InfoWorld Big Data