IDG Contributor Network: Use the cloud to create open, connected data lakes for AI, not data swamps

IDG Contributor Network: Use the cloud to create open, connected data lakes for AI, not data swamps

Produced by every single organization, data is the common denominator across industries as we look to advance how cloud and AI are incorporated into our operations and daily lives. Before the potential of cloud-powered data science and AI is fully realized, however, we first face the challenge of grappling with the sheer volume of data. This means figuring out how to turn its velocity and mass from an overwhelming firehouse into an organized stream of intelligence.

To capture all the complex data streaming into systems from various sources, businesses have turned to data lakes. Often on the cloud, these are storage repositories that hold an enormous amount of data until it’s ready to be analyzed: raw or refined, and structured or unstructured. This concept seems sound: the more data companies can collect, the less likely they are to miss important patterns and trends coming from their data.

However, a data scientist will quickly tell you that the data lake approach is a recipe for a data swamp, and there are a few reasons why. First, a good amount of data is often hastily stored, without a consistent strategy in place around how to organize, govern and maintain it. Think of your junk drawer at home: Various items get thrown in at random over time, until it’s often impossible to find something you’re looking for in the drawer, as it’s gotten buried.

This disorganization leads to the second problem: users are often not able to find the dataset once ingested into the data lake. Without a way to easily search for data, it’s nearly impossible to discover and use it, making it difficult for teams to ensure it stays within compliance or fed to the right knowledge workers. These problems mix and create a breeding ground for dark data: unorganized, unstructured, and unmanageable data.

Many companies have invested in growing their data lakes, but what they soon realize is that having too much information is an organizational nightmare. Multiple channels of data in a wide range of formats can cause businesses to quickly lose sight of the big picture and how their datasets connect.

Compounding the problem further, if datasets are incomplete or inadequate they often add even more noise when data scientists are searching for specific datasets. It’s like trying to solve a riddle without a critical clue. This leads to a major issue: Ddata scientists spend on average only 20 percent of their time on actual data analysis, and 80 percent of their time finding, cleaning, and reorganizing tons of data.

The power of the cloud

One of the most promising elements of the cloud is that it offers capabilities to reach across open and proprietary platforms to connect and organize all a company’s data, regardless of where it resides. This equips data science teams with complete visibility, helping them to quickly find the datasets they need and better share and govern them.

Accessing and cataloging data via the cloud also offers the ability to use and connect into new analytical techniques and services, such as predictive analytics, data visualization and AI. These cloud-fueled tools help data to be more easily understood and shared across multiple business teams and users—not just data scientists.

It’s important to note that the cloud has evolved. Preliminary cloud technologies required some assembly and self-governance, but today’s cloud allows companies to subscribe to an instant operating system in which data governance and intelligence are native. As a result, data scientists can get back to what’s important: developing algorithms, building machine learning models, and analyzing the data that matters.

For example, an enterprise can augment their data lake with cloud services that use machine learning to classify and cleanse incoming data sets. This helps organize and prepare it for ingestion into AI apps. The metadata from this process builds an index of all data assets, and data stewards can apply governance policies to ensure only authorized users will be able to access sensitive resources.

These actions set a data-driven culture in motion by giving teams the ability to access the right data at the right time. In turn, this gives them the confidence that all the data they share will only be viewed by appropriate teams.

Disillusioned with data? You’re not the only one

Even with cloud services and the right technical infrastructure, different teams are often reluctant to share their data. It’s all about trust. Most data owners are worried about a lack of data governance—the management of secure data—since they have no way of knowing who will use their data, or how they will use it. Data owners don’t want to take this risk, so they choose to hold onto their data, rather than share it or upload it into the data lake.

This can change. By shifting the focus away from restricting usage of data to enabling access, sharing and reuse, organizations will realize the positive value that good governance and strong security delivers to a data lake, which can then serve as an intelligent backbone of every decision and initiative a company undertakes.

Overall, the amount of data that enterprises need to collect and analyze will continue to grow unabated. If nothing is done differently, so will the problems associated with it. Instead, there needs to be a material change in the way people think of solving complex data problems. It starts by solving data findability, management and governance issues with a detailed data index. This way, data scientists can navigate through the deepest depths of their data lakes and unlock the value of organized and indexed data lakes—the foundation for AI innovation.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Spark tutorial: Get started with Apache Spark

Spark tutorial: Get started with Apache Spark

Apache Spark has become the de facto standard for processing data at scale, whether for querying large datasets, training machine learning models to predict future trends, or processing streaming data. In this article, we’ll show you how to use Apache Spark to analyze data in both Python and Spark SQL. And we’ll extend our code to support Structured Streaming, the new current state of the art for handling streaming data within the platform. We’ll be using Apache Spark 2.2.0 here, but the code in this tutorial should also work on Spark 2.1.0 and above.

How to run Apache Spark

Before we begin, we’ll need an Apache Spark installation. You can run Spark in a number of ways. If you’re already running a Hortonworks, Cloudera, or MapR cluster, then you might have Spark installed already, or you can install it easily through Ambari, Cloudera Navigator, or the MapR custom packages.

If you don’t have such a cluster at your fingertips, then Amazon EMR or Google Cloud Dataproc are both easy ways to get started. These cloud services allow you to spin up a Hadoop cluster with Apache Spark installed and ready to go. You’ll be billed for compute resources with an extra fee for the managed service. Remember to shut the clusters down when you’re not using them!

Of course, you could instead download the latest release from spark.apache.org and run it on your own laptop. You will need a Java 8 runtime installed (Java 7 will work, but is deprecated). Although you won’t have the compute power of a cluster, you will be able to run the code snippets in this tutorial.

Source: InfoWorld Big Data

SolarWinds Updates Its SaaS Portfolio

SolarWinds Updates Its SaaS Portfolio

SolarWinds has announced an all-new, breakthrough product and two advanced product updates in a major evolution of its SolarWinds Cloud® Software as a Service (SaaS) portfolio. The new offerings expand the company’s current capabilities for comprehensive, full-stack monitoring with the introduction of AppOptics™, a new application and infrastructure monitoring solution; significant updates to Papertrail™, providing faster search speeds and new log velocity analytics; and enhanced digital experience monitoring (DEM) functionality within Pingdom®.

Collectively, the new SolarWinds Cloud portfolio gives customers broad and unmatched visibility into logs, metrics, and tracing, as well as the digital experience. It will enable developers, DevOps engineers, and IT professionals to simplify and accelerate management and troubleshooting, from the infrastructure and application layers to the end-user experience. In turn, it will allow customers to focus on building the innovative capabilities businesses need for today’s on-demand environments.

“Application performance and the digital experience of users have a direct and significant impact on business success,” said Christoph Pfister, executive vice president of products, SolarWinds. “With the stakes so high, the ability to monitor across the three pillars of observability — logs, metrics, and tracing — is essential. SolarWinds Cloud offers this comprehensive functionality with industry-best speed and simplicity. With AppOptics and the enhancements to Papertrail and Pingdom, we’re breaking new ground by delivering even greater value to our customers in an incredibly powerful, disruptively affordable SaaS portfolio.”

AppOptics: Simple, unified monitoring for the modern application stack

Available today, AppOptics addresses challenges customers face from being forced to use disparate solutions for applications and infrastructure performance monitoring. To do so, it offers broad application performance monitoring (APM) language support with auto-instrumentation, distributed tracing functionality, and a host agent supported by a large open community to enable expanded infrastructure monitoring capabilities and comprehensive visibility through converged dashboards.

For a unified view, AppOptics’ distributed tracing, host and IT infrastructure monitoring, and custom metrics all feed the same dashboarding, analytics, and alerting pipelines. SolarWinds designed the solution to simplify and unify the management of complex modern applications, infrastructure, or both. This allows customers to solve problems and improve performance across the application stack, in an easy-to-use, as-a-service platform.

For application performance monitoring, the powerful distributed tracing functionality can follow requests across any number of hosts, microservices, and languages without manual instrumentation. Users can move quickly from visualizing trends to deep, code-level, root cause analysis.

AppOptics bridges the traditional divide between application and infrastructure health metrics with unified dashboards, alerting, and management features. The host agent runs Snap™ and Telegraf™ plug-ins, enabling drop-in monitoring of key systems. The solution integrates with a wide range of systems to support the heterogeneous infrastructure environments dominating today’s IT landscape.

AppOptics serves as a highly extensible custom metrics and analytics platform that brings together applications, infrastructure, and business data to deliver deep insights that enable fast problem resolution. Finally, with pricing starting at $7.50 USD per host/month, AppOptics delivers an unmatched combination of deep functionality and very affordable pricing, a breakthrough that makes powerful application performance monitoring capabilities accessible to virtually all organizations.

Papertrail: Faster, smarter troubleshooting with log velocity analytics and ‘lightning search’

Papertrail is a cloud-hosted log management solution that helps users troubleshoot infrastructure and application problems. The latest version introduced today includes log velocity analytics, which can instantly visualize log patterns and help identify anomalies. For example, customers now can visualize an increase in total logs sent by a server, a condition that could indicate imminent failure, or something out of the norm.

Also, new to Papertrail is “lightning search,” which will enable developers, support engineers, and systems administrators to search millions or billions of log messages faster than ever before, and then immediately act on information found within the log messages. Together, Papertrail’s latest enhancements empower customers to troubleshoot complex problems, error messages, application server errors, and slow database queries, faster and smarter, with full visibility across all logs.

Pingdom digital experience monitoring

Research firm Gartner estimates that, “by 2020, 30 percent of global enterprises will have strategically implemented DEM technologies or services, up from fewer than 5 percent today1.”  Pingdom, a market leader in the DEM arena, helps make websites faster and more reliable with powerful, easy-to-use uptime and performance monitoring functionality. Available on November 27, the Pingdom solution’s latest enhancements for digital experience monitoring include three new dashboard views that provide the ability to continuously enhance user experience on websites or web applications:

  • Sites View: Customers can quickly locate a user experience issue on any monitored website
  • Experience View: Customers can filter users and identify those affected by performance issues
  • Performance View: Customers can explore the technical cause of an issue and quickly and easily identify opportunities for performance improvements

The latest updates to the Pingdom solution’s digital experience monitoring will empower customers to know first when issues affect their site visitors’ experience, and quickly surface critical information needed to enhance the overall experience.

SolarWinds Cloud: The next evolution of SaaS-based full-stack monitoring

Today’s announcement of SolarWinds Cloud is another important milestone in the company’s drive to deliver a set of comprehensive, simple, and disruptively affordable full-stack monitoring solutions built upon a common, seamlessly integrated, SaaS-based platform. Since 2014, SolarWinds has dramatically expanded its cloud portfolio and capabilities through a series of acquisitions, while making significant progress integrating these acquired solutions, including Pingdom, Librato®, Papertrail, and TraceView™, under a common sales and operational model.

AppOptics builds on the technology and feedback SolarWinds put into Librato and TraceView since their introductions. Now, the company has integrated and enhanced this functionality within a single solution, taking another big step forward in advancing its strategy to unify full-stack monitoring across the three pillars of observability on a common SaaS-based platform.  SolarWinds’ ultimate goal is to enable a single view of infrastructure, applications, and digital experience, which will help customers solve their most complex performance and reliability problems quickly, with unexpected simplicity and industry-leading affordability.

 

Source: CloudStrategyMag

Cambridge Semantics Announces Semantic Layer For Multi-Cloud Environments

Cambridge Semantics Announces Semantic Layer For Multi-Cloud Environments

Cambridge Semantics has announced multi-cloud support for Anzo Smart Data Lake (SDL) 4.0, its flagship product that brings business meaning to all enterprise data.

Incorporating several new technical advancements designed to deliver a generational shift over current data lake, data management and data analytics offerings, Anzo SDL 4.0 now supports all three major cloud platforms — Google Cloud Platform, Amazon Web Services (AWS) and Microsoft Azure. The vision for multi-cloud capability enabled by Anzo will allow enterprises to choose from any combination of on-premise, hybrid cloud or multi-cloud solutions that makes the most sense for their business environment.

“Organizations today view their data assets as key business drivers for competitive advantage,” said Sean Martin, CTO of Cambridge Semantics.  “However, for many, the cost of running analytic solutions is drastically increasing, while speed-to-deployment remains a major challenge. Therefore, we are seeing an accelerated movement to the cloud and its variable cost model.”

As the largest cost center in most enterprise connected data analytics and machine learning programs becomes renting on-demand computing power, many enterprises are actively planning to work with multiple separate cloud vendors to take advantage of the price fluctuations in today’s increasingly commoditized cloud-based computing market, according to Martin.

Cambridge Semantics’ multi-cloud data center model for service-based cloud compute consumption is abstracted and completely automated to eliminate cloud infrastructure provider lock-in as well as securely shift sophisticated consumption of compute resources between the different cloud vendors dynamically to achieve the most competitive pricing at any given moment.

“Our customers want to be able to decide where to place their analytics compute spend globally on an hour-by-hour or even a minute-by-minute basis,” Martin said. “Not only does our open standards-based semantic layer provide business understandable meaning to all enterprise data, but the same metadata driven approach is essential in enabling customers to describe the policies that determine where that data is both securely stored and processed.”

“Cambridge Semantics offers the only semantically-driven smart data lake big data management and connected data analytics solution that entirely insulates enterprises from the different cloud vendor APIs,” Martin said. “It won’t be long before our customers will be able to see multiple vendors quotes for exactly how much the same analytics dashboard is going to cost them to compute before they click the button to select the cloud provider that will get to run that specific job.”

 

Source: CloudStrategyMag

IDG Contributor Network: A speedy recovery: the key to good outcomes as health care’s dependence on data deepens

IDG Contributor Network: A speedy recovery: the key to good outcomes as health care’s dependence on data deepens

It may have been slow to catch on compared to other industries, but the health care sector has developed a voracious appetite for data. Digital transformation topped the agenda at this year’s Healthcare Information and Management Systems Society (HIMSS) conference in Florida, and big data analytics in health care is on track to be worth more than $34 billion globally within the next five years—possibly sooner.

Electronic health records are growing in importance to enable more interdisciplinary collaboration, speed up communication on patient cases, and drive up the quality of care. Enhanced measurement and reporting have become critical for financial management and regulatory compliance, and to protect organizations from negligence claims and fraud. More strategically, big data is spurring new innovation, from smart patient apps to complex diagnostics driven by machine learning. Because of their ability to crunch big data and build knowledge at speed, computers could soon take over from clinicians in identifying patient conditions—in contrast to doctors relying on their clinical experience to determine what’s wrong.

But as health care providers come to rely increasingly on their IT systems, their vulnerability to data outages grows exponentially. If a planned surgery can’t go ahead due to an inability to look up case information, lab results, or digital images, the patient’s life might be put at risk.

Symptoms of bigger issues

Even loss of access to administrative systems can be devastating. The chaos inflicted across the UK National Health Service in May following an international cyberattack—which took down 48 of the 248 NHS trusts in England—gave a glimpse into health care’s susceptibility to paralysis if key systems become inaccessible, even for a short time. In the NHS’s case, out-of-date security settings were to blame for leaving systems at risk. But no one is immune to system downtime, as was highlighted recently by the outage at British Airways, which grounded much of its fleet for days, at great cost not to mention severe disruption for passengers.

Although disastrous events like these instill fear in CIOs, they can—and should—also serve as a catalyst for positive action. The sensible approach is to design data systems for failure—for times when, like patients, they are not firing on all cylinders. Even with the best intentions, biggest budgets and most robust data center facilities in the world, something will go wrong at some point according to the law of averages. So, it’s far better to plan for that than to assume an indefinitely healthy prognosis.

If the worst happens, and critical systems go down, recovery is rarely a matter of switching over to backup infrastructure and data—particularly if we’re talking about live records and information, which are currently in use and being continuously updated. Just think of the real-time monitoring of the vital signs of patients in intensive care units.

If a contingency data-set exists (as it should) in another location, the chances are that the original and the backup copy will be out of sync for much of the time, because of ongoing activity involving those records. In the event of an outage, the degree to which data is out of step will have a direct bearing on the organization’s speed of recovery.

To ensure continuous care and patient safety, health care organizations need the fastest possible recovery time. But how many organizations have identified and catered for this near-zero tolerance for downtime in their contingency provisions?

Emergency protocol

The issue must be addressed as data becomes an integral part of medical progress. Already, data is not just a key to better operational and clinical decisions, but also an intrinsic part of treatments—for example in processing the data that allows real-time control and movement in paralyzed patients. Eventually, these computer-assisted treatments will also come to rely on external servers, because local devices are unlikely to have the computing power to process all the data. They too will need live data backups to ensure the continuity and safety of treatment.

On a broader scale, data looks set to become pivotal to new business models (for example, determining private health care charges based on patient outcomes, otherwise known as “value-based medicine”).

While technology companies will be pulling out all the stops to keep up with these grander plans, maintaining live data continuity is already possible. So that’s one potential barrier to progress that can be checked off the list.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data