IDG Contributor Network: Are you treating your data as an asset?

IDG Contributor Network: Are you treating your data as an asset?

It’s a phrase we constantly hear, isn’t it? Data is a crucial business asset from which we can extract value and gain competitive advantage. Those who use data well will be the success stories of the future.

This got me thinking: If data is such a major asset, why do we hear so many stories about data leaks? Would these companies be quite so loose with other assets? You don’t hear about businesses losing hundreds of company cars or half a dozen buildings, do you?

If data is a potential asset, why aren’t companies treating it as such?

The reality is many businesses don’t treat data as an asset. In fact, it’s treated so badly there is increasing regulation forcing organizations to take better care of it. These external pressures have the potential to provide significant benefits, forcing a change in the way data is viewed across organizations from top to bottom. Forcing data to be treated as the asset it is.

If you can start to treat data as an asset, you can put yourself in a position where data really can provide a competitive advantage.

Where to start?

Clean up the mess

Do you have too little data in your organization? Probably not. In data discussion groups, a common refrain is that companies “have too much” and “it’s out of control.” Organizations are spending more and more resources on storing, protecting and securing it, but it’s not only the cost of keeping data that’s a problem. Tightening regulation will force you to clean up what you have.

It’s not an asset if you just keep collecting it and never do the housekeeping and maintenance that you should with any asset. If you don’t look after it, you will find it very difficult to realize value.

Your starting point is to ask yourself what you have, why you have it, and why you need it.

Gain some control

I talk regularly with people about the what, where, who and why of data. Understanding this will allow you to start to gain control of your asset.

Once it’s decided what your organization should have—and what you should be keeping—you need to understand exactly what you do have and, importantly, where it is stored: in data centers on laptops, on mobile devices or with cloud providers.

Next, the who and why. What other business asset does your company own that you wouldn’t know who’s using it and why? Yet companies seem to do this with data all the time. Look inside your own organization: Do you have a full understanding of who’s accessing your data…and why?

To treat our data like an asset, it’s crucial to understand how our data is been treated.

Build it the right home

As with any asset, data needs the right environment in which to thrive. Your organization no doubt offers decent working conditions for your employees, has a parking lot, provides regular maintenance for your car fleets and so on, doesn’t it? The same should be true for your data.

Consider your data strategy. Is it focused on the storage, media type or a particular vendor? Or are you building a modern, forward-thinking strategy focused on the data itself, and not the technology. This includes looking at how to ensure data is never siloed, can be placed in the right repository as needed, and can move seamlessly between repositories—be they on-prem, in the cloud or elsewhereyou’re your data always available? Can it be recovered quickly?

Build a strategy with a focus on the asset itself: the data.

Be ready to put it to work

To truly treat data as an asset, be prepared to sweat it like you would any other. If you can apply the things I’ve mentioned—cleanse it, gain control of it, have a data-focused strategy and have the right data in the right place—you can start to take advantage of tools that will allow you to gain value from it.

The ability to apply data analytics, machine learning, artificial intelligence and big data techniques to your assets allows you to not only understand your data better, but to begin to learn things from your data that you’d never previously been aware of…which is the most exciting opportunity data presents you.


All the above said, perhaps the best thing you can do for your data is to encourage a culture that is data-focused, one that realizes the importance of security and privacy, as well as understanding that data is crucial to your organization’s success.

If you can encourage and drive that cultural shift, there is every chance that your data will be treated as the asset it truly is—and you and your organization will be well-placed to reap the rewards that taking care of your data can bring.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Azure Databricks: Fast analytics in the cloud with Apache Spark

Azure Databricks: Fast analytics in the cloud with Apache Spark

We’re living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. It’s a flood of data that’s only going to get bigger as we add the sensors of the industrial internet of things, and the data that’s needed to deliver even the simplest predictive-maintenance systems.

Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. That’s where big data analysis tools like Apache Spark come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like Hadoop, today’s big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.

Apache Spark is fast, but Databricks is faster. Founded by the Spark team, Databricks is a cloud-optimized version of Spark that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like Jupyter Notebooks.

Microsoft’s new support for Databricks on Azure—called Azure Databricks—signals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.

Although you’ve always been able to install Spark or Databricks on Azure, Azure Databricks makes it a one-click experience, driving the setup process from the Azure Portal. You can host multiple analytical clusters, using autoscaling to minimize the resources in use. You can clone and edit clusters, tuning them for specific jobs or running different analyses on the same underlying data.

Configuring the Azure Databricks virtual appliance

The heart of Microsoft’s new service is a managed Databricks virtual appliance built using containers running on Azure Container Services. You choose the number of VMs in each cluster that it controls and uses, and then the service handles load automatically once it’s configured and running, loading new VMs to handle scaling.

Databricks’ tools interact directly with the Azure Resource Manager, which adds a security group and a dedicated storage account and virtual network to your Azure subscription. It lets you use any class of Azure VM for your Databricks cluster – so if you’re planning on using it to train machine learning systems, you’ll want to choose one of the latest GPU-based VMs. And of course, if one VM model isn’t right for your problem, you can switch it out for another. All you need to do is clone a cluster and change the VM definitions.

Querying in Spark brings engineering to data science

Spark has its own query language based on SQL, which works with Spark DataFrames to handle both structured and unstructured data. DataFrames are the equivalent of a relational table, constructed on top of collections of distributed data in different stores. Using named columns, you can construct and manipulate DataFrames with languages like R and Python; thus, both developers and data scientists can take advantage of them.

DataFrames is essentially a domain-specific language for your data, a language that extends the data analysis features of your chosen platform. By using familiar libraries with DataFrames, you can construct complex queries that take data from multiple sources, working across columns.

Because Azure Databricks is inherently data-parallel, and its queries are evaluated only when called to deliver actions, results can be delivered very quickly. Because Spark supports most common data sources, either natively or through extensions, you can add Azure Databricks DataFrames and queries to existing data relatively easily, reducing the need to migrate data to take advantage of its capabilities.

Although Azure Databricks provides a high-speed analytics layer across multiple sources, it’s also a useful tool for data scientists and developers trying to build and explore new models, turning data science into data engineering. Using Databricks Notebooks, you can develop scratchpad views of your data, with code and results in a single view.

The resulting notebooks are shared resources, so anyone can use them to explore their data and try out new queries. Once a query is tested and turned into a regular job, its output can be exposed as an element a Power BI dashboard, making Azure Databricks part of an end-to-end data architecture that allows more complex reporting than a simple SQL or NoSQL service—or even Hadoop.

Microsoft plus Databricks: a new model for Azure Services

Microsoft hasn’t yet detailed its pricing for Azure Databricks, but it does claim that it can improve performance and reduce cost by as much as 99 percent compared to running your own unmanaged Spark installation on Azure’s infrastructure services. If Microsoft’s claim bears out, that promises to be a significant saving, especially when you factor in no longer having to run your own Spark infrastructure.

Azure’s Databricks service will connect directly to Azure storage services, including Azure Data Lake, with optimizations for queries and caching. There’s also the option of using it with Cosmos DB, so you can take advantage of global data sources and a range of NoSQL data models, including MongoDB and Cassandra compatibility—as well as Cosmos DB’s graph APIs. It should also work well with Azure’s data-streaming tools, giving you a new option for near real-time IoT analytics.

If you’re already using Databricks’ Spark tools, this new service won’t affect you or your relationship with Databricks. It’s only if you take the models and analytics you’ve developed on-premises to Azure’s cloud that you’ll get a billing relationship with Microsoft. You’ll also have fewer management tasks, leaving you more time to work with your data.

Microsoft’s decision to work with an expert partner on a new service makes a lot of sense. Databricks has the expertise, and Microsoft has the platform. If the resulting service is successful, it could set a new pattern for how Azure evolves in the future, building on what businesses are already using and making them part of the Azure hybrid cloud without absorbing those services into Microsoft.

Source: InfoWorld Big Data

IDG Contributor Network: Use the cloud to create open, connected data lakes for AI, not data swamps

IDG Contributor Network: Use the cloud to create open, connected data lakes for AI, not data swamps

Produced by every single organization, data is the common denominator across industries as we look to advance how cloud and AI are incorporated into our operations and daily lives. Before the potential of cloud-powered data science and AI is fully realized, however, we first face the challenge of grappling with the sheer volume of data. This means figuring out how to turn its velocity and mass from an overwhelming firehouse into an organized stream of intelligence.

To capture all the complex data streaming into systems from various sources, businesses have turned to data lakes. Often on the cloud, these are storage repositories that hold an enormous amount of data until it’s ready to be analyzed: raw or refined, and structured or unstructured. This concept seems sound: the more data companies can collect, the less likely they are to miss important patterns and trends coming from their data.

However, a data scientist will quickly tell you that the data lake approach is a recipe for a data swamp, and there are a few reasons why. First, a good amount of data is often hastily stored, without a consistent strategy in place around how to organize, govern and maintain it. Think of your junk drawer at home: Various items get thrown in at random over time, until it’s often impossible to find something you’re looking for in the drawer, as it’s gotten buried.

This disorganization leads to the second problem: users are often not able to find the dataset once ingested into the data lake. Without a way to easily search for data, it’s nearly impossible to discover and use it, making it difficult for teams to ensure it stays within compliance or fed to the right knowledge workers. These problems mix and create a breeding ground for dark data: unorganized, unstructured, and unmanageable data.

Many companies have invested in growing their data lakes, but what they soon realize is that having too much information is an organizational nightmare. Multiple channels of data in a wide range of formats can cause businesses to quickly lose sight of the big picture and how their datasets connect.

Compounding the problem further, if datasets are incomplete or inadequate they often add even more noise when data scientists are searching for specific datasets. It’s like trying to solve a riddle without a critical clue. This leads to a major issue: Ddata scientists spend on average only 20 percent of their time on actual data analysis, and 80 percent of their time finding, cleaning, and reorganizing tons of data.

The power of the cloud

One of the most promising elements of the cloud is that it offers capabilities to reach across open and proprietary platforms to connect and organize all a company’s data, regardless of where it resides. This equips data science teams with complete visibility, helping them to quickly find the datasets they need and better share and govern them.

Accessing and cataloging data via the cloud also offers the ability to use and connect into new analytical techniques and services, such as predictive analytics, data visualization and AI. These cloud-fueled tools help data to be more easily understood and shared across multiple business teams and users—not just data scientists.

It’s important to note that the cloud has evolved. Preliminary cloud technologies required some assembly and self-governance, but today’s cloud allows companies to subscribe to an instant operating system in which data governance and intelligence are native. As a result, data scientists can get back to what’s important: developing algorithms, building machine learning models, and analyzing the data that matters.

For example, an enterprise can augment their data lake with cloud services that use machine learning to classify and cleanse incoming data sets. This helps organize and prepare it for ingestion into AI apps. The metadata from this process builds an index of all data assets, and data stewards can apply governance policies to ensure only authorized users will be able to access sensitive resources.

These actions set a data-driven culture in motion by giving teams the ability to access the right data at the right time. In turn, this gives them the confidence that all the data they share will only be viewed by appropriate teams.

Disillusioned with data? You’re not the only one

Even with cloud services and the right technical infrastructure, different teams are often reluctant to share their data. It’s all about trust. Most data owners are worried about a lack of data governance—the management of secure data—since they have no way of knowing who will use their data, or how they will use it. Data owners don’t want to take this risk, so they choose to hold onto their data, rather than share it or upload it into the data lake.

This can change. By shifting the focus away from restricting usage of data to enabling access, sharing and reuse, organizations will realize the positive value that good governance and strong security delivers to a data lake, which can then serve as an intelligent backbone of every decision and initiative a company undertakes.

Overall, the amount of data that enterprises need to collect and analyze will continue to grow unabated. If nothing is done differently, so will the problems associated with it. Instead, there needs to be a material change in the way people think of solving complex data problems. It starts by solving data findability, management and governance issues with a detailed data index. This way, data scientists can navigate through the deepest depths of their data lakes and unlock the value of organized and indexed data lakes—the foundation for AI innovation.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Spark tutorial: Get started with Apache Spark

Spark tutorial: Get started with Apache Spark

Apache Spark has become the de facto standard for processing data at scale, whether for querying large datasets, training machine learning models to predict future trends, or processing streaming data. In this article, we’ll show you how to use Apache Spark to analyze data in both Python and Spark SQL. And we’ll extend our code to support Structured Streaming, the new current state of the art for handling streaming data within the platform. We’ll be using Apache Spark 2.2.0 here, but the code in this tutorial should also work on Spark 2.1.0 and above.

How to run Apache Spark

Before we begin, we’ll need an Apache Spark installation. You can run Spark in a number of ways. If you’re already running a Hortonworks, Cloudera, or MapR cluster, then you might have Spark installed already, or you can install it easily through Ambari, Cloudera Navigator, or the MapR custom packages.

If you don’t have such a cluster at your fingertips, then Amazon EMR or Google Cloud Dataproc are both easy ways to get started. These cloud services allow you to spin up a Hadoop cluster with Apache Spark installed and ready to go. You’ll be billed for compute resources with an extra fee for the managed service. Remember to shut the clusters down when you’re not using them!

Of course, you could instead download the latest release from and run it on your own laptop. You will need a Java 8 runtime installed (Java 7 will work, but is deprecated). Although you won’t have the compute power of a cluster, you will be able to run the code snippets in this tutorial.

Source: InfoWorld Big Data

SolarWinds Updates Its SaaS Portfolio

SolarWinds Updates Its SaaS Portfolio

SolarWinds has announced an all-new, breakthrough product and two advanced product updates in a major evolution of its SolarWinds Cloud® Software as a Service (SaaS) portfolio. The new offerings expand the company’s current capabilities for comprehensive, full-stack monitoring with the introduction of AppOptics™, a new application and infrastructure monitoring solution; significant updates to Papertrail™, providing faster search speeds and new log velocity analytics; and enhanced digital experience monitoring (DEM) functionality within Pingdom®.

Collectively, the new SolarWinds Cloud portfolio gives customers broad and unmatched visibility into logs, metrics, and tracing, as well as the digital experience. It will enable developers, DevOps engineers, and IT professionals to simplify and accelerate management and troubleshooting, from the infrastructure and application layers to the end-user experience. In turn, it will allow customers to focus on building the innovative capabilities businesses need for today’s on-demand environments.

“Application performance and the digital experience of users have a direct and significant impact on business success,” said Christoph Pfister, executive vice president of products, SolarWinds. “With the stakes so high, the ability to monitor across the three pillars of observability — logs, metrics, and tracing — is essential. SolarWinds Cloud offers this comprehensive functionality with industry-best speed and simplicity. With AppOptics and the enhancements to Papertrail and Pingdom, we’re breaking new ground by delivering even greater value to our customers in an incredibly powerful, disruptively affordable SaaS portfolio.”

AppOptics: Simple, unified monitoring for the modern application stack

Available today, AppOptics addresses challenges customers face from being forced to use disparate solutions for applications and infrastructure performance monitoring. To do so, it offers broad application performance monitoring (APM) language support with auto-instrumentation, distributed tracing functionality, and a host agent supported by a large open community to enable expanded infrastructure monitoring capabilities and comprehensive visibility through converged dashboards.

For a unified view, AppOptics’ distributed tracing, host and IT infrastructure monitoring, and custom metrics all feed the same dashboarding, analytics, and alerting pipelines. SolarWinds designed the solution to simplify and unify the management of complex modern applications, infrastructure, or both. This allows customers to solve problems and improve performance across the application stack, in an easy-to-use, as-a-service platform.

For application performance monitoring, the powerful distributed tracing functionality can follow requests across any number of hosts, microservices, and languages without manual instrumentation. Users can move quickly from visualizing trends to deep, code-level, root cause analysis.

AppOptics bridges the traditional divide between application and infrastructure health metrics with unified dashboards, alerting, and management features. The host agent runs Snap™ and Telegraf™ plug-ins, enabling drop-in monitoring of key systems. The solution integrates with a wide range of systems to support the heterogeneous infrastructure environments dominating today’s IT landscape.

AppOptics serves as a highly extensible custom metrics and analytics platform that brings together applications, infrastructure, and business data to deliver deep insights that enable fast problem resolution. Finally, with pricing starting at $7.50 USD per host/month, AppOptics delivers an unmatched combination of deep functionality and very affordable pricing, a breakthrough that makes powerful application performance monitoring capabilities accessible to virtually all organizations.

Papertrail: Faster, smarter troubleshooting with log velocity analytics and ‘lightning search’

Papertrail is a cloud-hosted log management solution that helps users troubleshoot infrastructure and application problems. The latest version introduced today includes log velocity analytics, which can instantly visualize log patterns and help identify anomalies. For example, customers now can visualize an increase in total logs sent by a server, a condition that could indicate imminent failure, or something out of the norm.

Also, new to Papertrail is “lightning search,” which will enable developers, support engineers, and systems administrators to search millions or billions of log messages faster than ever before, and then immediately act on information found within the log messages. Together, Papertrail’s latest enhancements empower customers to troubleshoot complex problems, error messages, application server errors, and slow database queries, faster and smarter, with full visibility across all logs.

Pingdom digital experience monitoring

Research firm Gartner estimates that, “by 2020, 30 percent of global enterprises will have strategically implemented DEM technologies or services, up from fewer than 5 percent today1.”  Pingdom, a market leader in the DEM arena, helps make websites faster and more reliable with powerful, easy-to-use uptime and performance monitoring functionality. Available on November 27, the Pingdom solution’s latest enhancements for digital experience monitoring include three new dashboard views that provide the ability to continuously enhance user experience on websites or web applications:

  • Sites View: Customers can quickly locate a user experience issue on any monitored website
  • Experience View: Customers can filter users and identify those affected by performance issues
  • Performance View: Customers can explore the technical cause of an issue and quickly and easily identify opportunities for performance improvements

The latest updates to the Pingdom solution’s digital experience monitoring will empower customers to know first when issues affect their site visitors’ experience, and quickly surface critical information needed to enhance the overall experience.

SolarWinds Cloud: The next evolution of SaaS-based full-stack monitoring

Today’s announcement of SolarWinds Cloud is another important milestone in the company’s drive to deliver a set of comprehensive, simple, and disruptively affordable full-stack monitoring solutions built upon a common, seamlessly integrated, SaaS-based platform. Since 2014, SolarWinds has dramatically expanded its cloud portfolio and capabilities through a series of acquisitions, while making significant progress integrating these acquired solutions, including Pingdom, Librato®, Papertrail, and TraceView™, under a common sales and operational model.

AppOptics builds on the technology and feedback SolarWinds put into Librato and TraceView since their introductions. Now, the company has integrated and enhanced this functionality within a single solution, taking another big step forward in advancing its strategy to unify full-stack monitoring across the three pillars of observability on a common SaaS-based platform.  SolarWinds’ ultimate goal is to enable a single view of infrastructure, applications, and digital experience, which will help customers solve their most complex performance and reliability problems quickly, with unexpected simplicity and industry-leading affordability.


Source: CloudStrategyMag