IDG Contributor Network: The clash of big data and the cloud

IDG Contributor Network: The clash of big data and the cloud

Recently, I visited a few conferences and I noticed a somewhat hidden theme. While a lot of attention was being paid to moving to a (hybrid) cloud-based architecture and what you need for that (such as cloud management platforms), a few presentations showed an interesting overall development that everybody acknowledges but that does not get a lot of close attention: the enormous growth of the amount of digital data stored in the world.

What especially caught my attention was a presentation from PureStorage (a storage vendor) that combined two data points from two other vendors. First, a June 2017 Cisco white paper The Zettabyte Era: Trends and Analysis that extrapolates the growth of internet bandwidth, the second a Seagate-sponsored IDC study Data Age 2025 that extrapolates the trend of data growth in the world. PureStorage combined both extrapolations in the following figure (reused with permission):

purestoragePureStorage

PureStorage’s depiction of the clash between world data growth and world internet bandwidth growth.

These trends—if they become reality, and there are reasons enough to think these predictions to be reasonable—are going to have a major impact of the computing and data landscapes in the years to come. And they will especially impact the cloud hoopla that is still in full force. Note: The cloud is real and will be an important part of future IT landscapes, but simplistic ideas about it being a panacea for every IT ailment are strongly reminiscent of the “new economy” dreams of the dot-com boom. And we know how that ended.

The inescapable issue

Anyway, there are two core elements of all IT: the data and the logic working with/on the data. Big data is not just about the data. Data is useless (or as Uncle Ludwig would have it: meaningless) unless it can be used. What everybody working with big data already knows: To use huge amounts of data, you need to bring the processing to the data and not the data to the processing. Having the processing at any “distance” creates such a transport bottleneck that performance decreases to almost nothing and any function of that logic becomes a purely theoretical affair.

Even with small amounts of data, this already may happen because of latency. For instance, moving your application server to the cloud while retaining your database server on premises may work on paper, but when the application is sensitive to latency between it and the database it doesn’t work at all. And that can already be the case for small amounts of data. This is why many organizations are trying to adapt software so it becomes less latency-sensitive, thus enabling a move into the cloud. But with huge amounts of data, you need to bring processing and data close to each other, else it just does not work. Add the need for massive parallelism to handle that data and you get Hadoop and other architectures that tackle the problem of processing huge amounts of data.

Now, the amount of data in the world is growing exponentially. If IDC is to be believed, in a few years’ time, the world is expected to store about 50ZB (zettabytes), or 50,000,000,000,000,000,000,000 bytes). On the other hand, while the total capacity of the Internet to move data around grows too, it does at a far more leisurely pace. In the same period that world data size grows to 50ZB, the total internet bandwidth will reach something like 2.5ZB per year (if Cisco is to be believed).

The conclusion from those two (not unreasonable) expectations is that the available internet bandwidth is by far not enough to move a sizeable fraction of the data around. And that is ignoring even the fact that about 80 percent of the current bandwidth is used for streaming video. So, even if you have coded your way around the latency issues in your core application, for cases with larger amounts of data, there will be a bandwidth issue as well.

Now, is this issue actually a problem? Not if the processing or use of that data happens locally—that is, in the same datacenter that holds the data. But while on the one hand the amount of data is growing exponentially, the world is also aggressively pursuing cloud-strategies; that is, to put all kind of workloads to the cloud, in the absolute extremes even “serverless” (for example, AWS Lambda).

Assuming that only small-sized results (calculated from huge data sets) may move around probably only helps a bit, because the real value of huge amounts of data comes from combining them. And that may mean combining data from different owners (your customer records with a feed from Twitter, for instance). It is the aggregation of all that different sets that is the issue.

So, what we see is two opposing developments. On the one hand, everybody is busy adapting to a cloud-based architecture that in the end is based on distributed processing of distributed data. On the other and, the amount of data we use is getting so large that we have to consolidate data and its processing in a single physical location.

So, what does that imply?

Well, we may expect that what Hadoop does at the application architecture level will also happen on a world level: the huge data sets will be attractors for the logic that make them meaningful. And those huge data sets will gravitate together.

Case in point: Many are now scrambling to minimize the need to move that data around. So, in the IoT world there is a lot of talk about edge computing: handling data locally where the sensors and other IoT devices are. Of course, what that also means is that the processing must also be locally, and you can safely assume that you will not be bringing the same level of computing power to bear in a (set of) sensors than what you can do in big analytics setups. Or: you probably won’t see a Hadoop cluster under the hood of your car anytime soon. So, yes, you can minimize data traffic that way, but at the expense of how much you can compute.

There is another solution for this issue: Stick together in datacenters. And that is also what I see happening.  Colocation providers are on the rise. They offer large datacenters with optimized internal traffic capabilities where both cloud providers and large cloud users are sticking together. Logically, you may be in the cloud, but physically you are on the same premises as your cloud provider. You don’t want to run your logic just on AWS or Azure; you want to do that in a datacenter where you also have your own private data lake so all data is local to the processing and data aggregation is local as well. I’ve written elsewhere (Reverse Cloud | Enterprise Architecture Professional Journal) on the possibility that cloud providers might be extending into your datacenters, but the colocation pattern is another possible solution for solving the inescapable bandwidth and latency issues arising from the exponential growth of data.

The situation may not be as dire as I’m sketchinf it. For example, maybe the actual average volatility of all that data will ultimately be very low. On the other hand, you would not want to run your analytics on stale data. But one conclusion can be drawn already: Simply assuming that you can distribute your workloads to a host of different cloud providers (the “cloud … yippy!” strategy) is risky, especially if at the same time the amount of data you are working with grows exponentially (which it certainly will, if everyone wants to combine their own data with streams from Twitter, Facebook, etc., let alone if those combinations spawn all sorts of new streams).

Therefore, making good strategic design decisions (also known as “architecture”) about the locations of your data and processing (and what can and can’t be isolated from other data and) is key. Strategic design decisions … hmm that sounds like a job for architecture.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Enter Fortifies Carrier-Neutral Interconnection Capabilities At Its Milan Data Center

Enter Fortifies Carrier-Neutral Interconnection Capabilities At Its Milan Data Center

Enter has announced enhanced connectivity capabilities at its Milan Caldera data center campus with the addition of its neutral interconnection facility, MIL2.

An expansion of Enter’s existing MIL1 data center, MIL2 is purpose-built to facilitate cost-effective cross-connects to a multitude of carriers and content providers via its Meet-Me Room.

Enter’s MIL1 and MIL2 data centers provide a reliable environment for telco colocation, with redundant power distribution and a generator that provides the facility with at least five days of backup power at full load for business continuity.

Leveraging Enter’s expansive backbone and metro dark fiber network, MIL2 customers can access hundreds of networks within the Caldera campus and seamlessly connect to additional local network providers and remote facilities in the Milan area.  Additionally, nearby landing points in Bari and Palermo enable access to submarine cables Southeast Asia-Middle East-Western Europe 5 (SEA-ME-WE 5) and Asia-Africa-Europe (AAE-1) by way of Enter’s strategic provider partnerships.

“We expanded our neutral interconnection facility to provide customers with cost-effective, reliable interconnection opportunities. Located in one of Italy’s key connectivity and fiber hubs, our Milan Caldera data centers offer a strategic and cost-effective alternative to Frankfurt and Marseille,” says Milko Ilari, Head of International Business & Strategy at Enter.  “In addition to serving as a bridge for operators looking to expand their reach to or within Western Europe, MIL2 is also designed to accommodate unique project requirements as well as facilitate mutually beneficial partnerships amongst our customers.”

Enter’s transparent, partner-centric approach is also evident in its recent Open Compute Project (OCP) hardware deployment for Enter Cloud Suite (ECS) in MIL2.  The OCP was launched by Facebook and provides an open source design for servers, racks and data center facilities, lowering vendor lock-in and increasing community participation in data center.  By adopting OCP, companies can dramatically reduce CAPEX and OPEX, while driving innovation through the incremental contribution of the open source community.

ECS is the first European, OpenStack-based cloud Infrastructure-as-a-Service (IaaS) solution.  With one connection to Enter, small to mid-size communication service providers can affordably expand their network footprint and reach all of Europe’s leading IXs.  In addition to ECS, Enter also offers Colocation, Ethernet and internet access services, Virtual Private Networks (VPNs), dark fiber, and data center services at its MIL2 data center.

Source: CloudStrategyMag

iQuate Releases iQCloud

iQuate Releases iQCloud

iQuate has announced the availability of iQCloud, the most advanced automated discovery and service mapping solution for digital enterprises.

As cloud computing becomes mainstream, business and IT professionals must understand how their IT services are delivered to run their business in a digital age. How can you harness the power of the cloud if you don’t understand how your existing IT services are delivered today?

“iQCloud gives organizations what they need for a smarter way to the cloud,” says Patrick McNally, CEO of iQuate. “We call it Discovery and Service Mapping 2.0 because it automatically discovers, maps, sizes, tracks, and enables dynamic service management with top-down application services visibility together with bottom-up infrastructure clarity.  We will tell you how your IT services are delivered now, and we’ll help you manage them wherever they are delivered in the future — across legacy, private and public cloud environments.”

McNally brought together a highly respected team with several decades of experience in discovery and service mapping to create iQCloud. The iQuate team worked with an existing global customer base to build a solution that reduces to minutes and hours what once required weeks, and months of manual effort and required deep in-house knowledge of IT resources.

iQCloud provides actionable information to IT and business professionals within the first hour of onboarding and doesn’t require installation or deep knowledge of the IT enterprise it exposes. “The technology has been designed to get more organizations into the cloud faster and with lower risk,” says McNally. “iQCloud automatically provides a holistic view across your entire estate, including highly dynamic, hybrid IT environments.”

Source: CloudStrategyMag

Vertiv Introduces Cloud Capabilities And IoT Gateway

Vertiv Introduces Cloud Capabilities And IoT Gateway

Vertiv, formerly Emerson Network Power, has announced a significant cloud-based addition to the Vertiv portfolio that will empower customers with deeper insights across the data center. The Vertiv cloud offering will leverage the collective knowledge gleaned from decades of data center projects to deliver real-time analysis that will simplify and streamline data center operations and planning.

As part of the Vertiv cloud offering, now available is a new Internet of Things (IoT) gateway that provides added security with simple installation and commissioning to streamline data center connectivity. The Vertiv™ RDU300 gateway, a new entry in the Vertiv RDU family of monitoring and control products, integrates with building management systems and ensures that any data passed to the Vertiv cloud from the customer site is done securely and using minimum bandwidth. Together the Vertiv cloud offering and Vertiv RDU300 gateway enable remote visibility, collection, and analysis of critical infrastructure data across all Vertiv products.

“As an organization, we have designed and built data centers of all shapes and sizes and have millions of equipment installations in data centers and IT facilities in every corner of the globe,” said Patrick Quirk, vice president and general manager of Global Management Systems at Vertiv. “The accumulated knowledge from past, present and future deployments is a powerful resource, and this cloud-based initiative operationalizes that resource in a way that will bring unprecedented capabilities to our customers.”

The Vertiv cloud initiative unlocks the data and deep domain knowledge Vertiv has accrued from its history of monitoring and servicing hardware, software and sensors, including its Chloride®, Liebert®, and NetSure™ brands. With billions of historical uninterruptible power supply (UPS), battery and thermal system data points populating the Vertiv cloud, supplemented by the constant inflow of real-time data, operators will be able to make decisions and take actions based on data-based insight and best practices from across the industry.

Vertiv will use its cloud to aggregate, anonymize and analyze data from IT deployments around the world, identifying trends and patterns that will transform data center operation practices and virtually eliminate the traditional break/fix model and preventative maintenance. Starting with battery monitoring and monitoring for select UPS and power distribution unit (PDU) systems, Vertiv will leverage its cloud to continuously evaluate performance against billions of existing data points to anticipate everything from maintenance needs to efficiency improvements. The Vertiv cloud will synthesize that information and deliver preemptive prompts to data center managers, who can remotely trigger the appropriate actions through qualified personnel and eventually secure Vertiv gateway systems within their facilities and more effectively plan in the short and long term.

 

Source: CloudStrategyMag

No, you shouldn’t keep all that data forever

No, you shouldn’t keep all that data forever

Modern ethos is that all data is valuable, should be stored forever, and that machine learning will one day magically find the value of it. You’ve probably seen that EMC picture about how there will be 44 zettabytes of data by 2020? Remember how everyone had Fitbits and Jawbone Ups for about a minute? Now Jawbone is out of business. Have you considered this “all data is valuable” fad might be the corporate equivalent? Maybe we shouldn’t take a data storage company’s word on it that we should store all data and never delete anything.

Back in the early days of the web it was said that the main reasons people went there were for porn, jobs, or cat pictures. If we download all of those cat pictures and run a machine learning algorithm on them, we can possibly determine the most popular colors of cats, the most popular breeds of cats, and the fact that people really like their cats. But we don’t need to do this—because we already know these things. Type any of those three things into Google and you’ll find the answer. Also, with all due respect to cat owners, this isn’t terribly important data.

Your company has a lot of proverbial cat pictures. It doesn’t matter what your policy and procedures for inventory retention were in 1999. Any legal issues you had reason to store back then have passed the statute of limitation. There isn’t anything conceivable that you could glean from that old data that could not be gleaned from any of the more recent revisions.

Machine learning or AI isn’t going to tell you anything interesting about any of your 1999 policies and procedures for inventory retention. It might even be sort of a type of “dark data,” because your search tool probably boosts everything else above it, so unless someone queries for “inventory retention procedure for 1999,” it isn’t going to come up.

You’ve got logs going back to the beginning of time. Even the Jawbone UP didn’t capture my every breath and certainly didn’t store my individual steps for all time. Sure each breath or step may have slightly different characteristics, but it isn’t important. Likewise, It probably doesn’t matter how many exceptions per hour your Java EE applications server used to throw in 2006. You use Node.js now anyhow. If “how many errors per hour per year” is a useful metric, you can probably just summarize that. You don’t need to keep every log for all time. It isn’t reasonable to expect it to be useful.

Supposedly, we’re keeping this stuff around for the day when AI or machine learning find something useful in it. But machine learning isn’t magical. Mostly, machine learning falls into classification, regression, and clustering. Clustering basically groups stuff that looks “similar”—but it isn’t very likely your 2006 app server logs have anything useful in them that can be found via clustering. The other two algorithms require you to think of something and “train” the machine learning. This means you need a theory of what could be useful and to find something useful, then train the computer to find it. Don’t you have better things to do?

Storage is cheap, but organization and insight are not. Just because you got a good deal on your SAN or have been running some kind of mirrored JBOD setup with a clustered file system doesn’t mean that storing noise is actually cheap. You need to consider the human costs of organizing, maintaining, and keeping all this stuff around. Moreover, while modern search technology is good at sorting relevant stuff from irrelevant, it does cost you something to do so. So while autumn is on the wane, go ahead and burn some proverbial corporate leaves.

It really is okay if you don’t keep it.

Source: InfoWorld Big Data

IDG Contributor Network: How in-memory computing drives digital transformation with HTAP

IDG Contributor Network: How in-memory computing drives digital transformation with HTAP

In-memory computing (IMC) is becoming a fixture in the data center, and Gartner predicts that by 2020, IMC will be incorporated into most mainstream products. One of the benefits of IMC is that it will enable enterprises to start implementing hybrid transactional/analytical processing (HTAP) strategies, which have the potential to revolutionize data processing by providing real-time insights into big data sets while simultaneously driving down costs.

Here’s why IMC and HTAP are tech’s new power couple.

Extreme processing performance with IMC

IMC platforms maintain data in RAM to process and analyze data without continually reading and writing data from a disk-based database. Architected to distribute processing across a cluster of commodity servers, these platforms can easily be inserted between existing application and data layers with no rip-and-replace.

They can also be easily and cost effectively scaled by adding new servers to the cluster and can automatically take advantage of the added RAM and CPU processing power. The benefits of IMC platforms include performance gains of 1,000X or more, the ability to scale to petabytes of in-memory data, and high availability thanks to distributed computing.

In-memory computing isn’t new, but until recently, only companies with extremely high-performance, high-value applications could justify the cost of such solutions. However, the cost of RAM has dropped steadily, approximately 10 percent per year for decades. So today the value gained from in-memory computing and the increase in performance it provides can be cost-effectively realized by a growing number of companies in an increasing number of use cases.

Transactions and analytics on the same data set with HTAP

HTAP is a simple concept: the ability to process transactions (such as investment buy and sell orders) while also performing real-time analytics (such as calculating historical account balances and performance) on the operational data set.

For example, in a recent In-Memory Computing Summit North America keynote, Rafique Awan from Wellington Management described the importance of HTAP to the performance of the company’s new investment book of rRecord (IBOR). Wellington has more than $1 trillion in assets under management.

But HTAP isn’t easy. In the earliest days of computing, the same data set was used for both transaction processing and analytics. However, as data sets grew in size, queries started slowing down the system and could lock up the database.

To ensure fast transaction processing and flexible analytics for large data sets, companies deployed transactional databases, referred to as online transaction processing (OLTP) systems, solely for the purpose of recording and processing transactions. Separate online analytical processing (OLAP) databases were deployed, and data from an OLTP system was periodically (daily, weekly, etc.) extracted, transformed, and loaded (ETLed) into the OLAP system.

This bifurcated architecture has worked well for the last few decades. But the need for real-time transaction and analytics processing in the face of rapidly growing operational data sets has become crucial for digital transformation initiatives, such as those driving web-scale applications and internet of things (IoT) use cases. With separate OLTP and OLAP systems, however, by the time the data is replicated from the OLTP to the OLAP system, it is simply too late—real-time analytics are impossible.

Another disadvantage of the current strategy of separate OLTP and OLAP systems is that IT must maintain separate architectures, typically on separate technology stacks. This results in hardware and software costs for both systems, as well as the cost for human resources to build and maintain them.

The new power couple

With in-memory computing, the entire transactional data set is already in RAM and ready for analysis. More sophisticated in-memory computing platforms can co-locate compute with the data to run fast, distributed analytics across the data set without impacting transaction processing. This means replicating the operational data set to an OLAP system is no longer necessary.

According to Gartner, in-memory computing is ideal for HTAP because it supports real-time analytics and situational awareness on the live transaction data instead of relying on after-the-fact analyses on stale data. IMC also has the potential to significantly reduce the cost and complexity of the data layer architecture, allowing real-time, web-scale applications at a much lower cost than separate OLTP/OLAP approaches.

To be fair, not all data analytics can be performed using HTAP. Highly complex, long running queries must still be performed in OLAP systems. However, HTAP can provide businesses with a completely new ability to react immediately to a rapidly changing environment.

For example, for industrial IoT use cases, HTAP can enable the real-time capture of incoming sensor data and simultaneously make real-time decisions. This can result in more timely maintenance, higher asset utilization, and reduced costs, driving significant financial benefits. Financial services firms can process transactions in their IBORs and analyze their risk and capital requirements at any point in time to meet the real-time regulatory reporting requirements that impact their business.

Online retailers can transact purchases while simultaneously analyzing inventory levels and other factors, such as weather conditions or website traffic, to update pricing for a given item in real time. And health care providers can continually analyze the transactional data being collected from hundreds or thousands of in-hospital and home-based patients to provide immediate individual recommendations while also looking at trend data for possible disease outbreaks.

Finally, by eliminating the need for separate databases, an IMC-powered HTAP system can simplify life for development teams and eliminate duplicative costs by reducing the number of technologies in use and downsizing to just one infrastructure.

The fast data opportunity

The rapid growth of data and the drive to make real-time decisions based on the data generated as a result of digital transformation initiatives is driving companies to consider IMC-based HTAP solutions. Any business faced with the opportunities and challenges of fast data from initiatives such as web-scale applications and the internet of things, which require ever-greater levels of performance and scale, should definitely take the time to learn more about in-memory computing-driven hybrid transactional/analytical processing.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Equinix Collaboration with AWS Expands to Additional Markets

Equinix Collaboration with AWS Expands to Additional Markets

Equinix, Inc. announced an expansion of its collaboration with Amazon Web Services (AWS) with the extension of direct, private connectivity to the AWS Direct Connect service to four additional Equinix International Business Exchange™ (IBX®) data centers in North America and Europe. The move advances the Equinix and AWS collaboration that enables businesses to connect their owned and managed infrastructure directly to AWS via a private connection, which helps customers reduce costs, improve performance and achieve a more consistent network experience.

“When businesses compete at the digital edge, proximity matters. To be successful, enterprises require superior interconnection. Together, Equinix and AWS are catalysts and enablers of this new digital and interconnected world. By offering AWS Direct Connect in our data centers across the globe, we are helping our customers solve their business challenges, drive better outcomes, and simplify their journey to the public cloud,” said Kaushik Joshi, global managing director, Strategic Alliances at Equinix.

Effective immediately, AWS Direct Connect will be available to customers in Equinix IBX data centers in Helsinki, Madrid, Manchester, and Toronto, bringing the total number of Equinix metros offering AWS Direct Connect to 21, globally. Customers can connect to AWS Direct Connect at all available speeds via Equinix Cloud Exchange™ (ECX), cross connects or Equinix-provided metro connectivity options. Additionally, with the recently announced AWS Direct Connect Gateway, Equinix customers can also access multiple AWS regions with a single connection to AWS Direct Connect.

In addition to the four new markets announced today, Equinix offers AWS Direct Connect in the Amsterdam, Chicago, Dallas, Frankfurt, Los Angeles, London, Munich, New York, Osaka, São Paulo, Seattle, Silicon Valley, Singapore, Sydney, Tokyo, Warsaw, and Washington, D.C. metro areas.

Direct connection to AWS inside Equinix IBX data centers is ideal for specific enterprise use cases, such as:

  • Securing and accelerating data flows: Applications such as business intelligence, pattern recognition and data visualization require heavy compute and low-latency connectivity to large data sets. Equinix Data Hub™ and Cloud Exchange can help enterprises control data movement and placement by enabling private, secure and fast connectivity between private data storage devices and AWS compute nodes, maintaining data residency and accelerating access between storage and compute resources.
  • Interconnecting to hybrid cloud and business ecosystems: Direct connection to AWS via the Equinix Cloud Exchange offers enterprises access to networks, IaaS, PaaS and SaaS providers and connectivity to thousands of other business ecosystems.
  • Direct and private connectivity to strategic cloud providers that avoids the public internet is a growing business practice for leading companies. According to the Global Interconnection Index, a market study published recently by Equinix, the capacity for private data exchange between enterprises and cloud providers is forecast to grow at 160% CAGR between now and 2020.

Source: CloudStrategyMag

Equinix Achieves AWS Networking Competency Status

Equinix Achieves AWS Networking Competency Status

Equinix, Inc. has announced it has achieved Amazon Web Services (AWS) Networking Competency status in the AWS Partner Network (APN), underscoring Equinix’s ongoing commitment to serving AWS customers by providing private and secure access inside its global footprint of International Business Exchange™ (IBX®) data centers. This distinction recognizes Equinix as a key Technology Partner in the APN, helping customers adopt, develop, and deploy networks on AWS.

“Equinix is proud to achieve AWS Networking Competency status. Together, Equinix and AWS Direct Connect accelerate Amazon Web Services adoption by making it easier to directly and securely connect to AWS and ensure the performance and availability of mission-critical applications and workloads,” said Kaushik Joshi, global managing director, Strategic Alliances at Equinix.

Achieving the AWS Networking Competency differentiates Equinix as an APN member that provides specialized demonstrated technical proficiency and proven customer success with specific focus on networking based on AWS Direct Connect. To receive the designation, APN members must possess deep AWS expertise and deliver solutions seamlessly on AWS.

AWS is enabling scalable, flexible and cost-effective solutions from startups to global enterprises. To support the seamless integration and deployment of these solutions, AWS established the AWS Competency Program to help customers identify Consulting and Technology APN Partners with deep industry experience and expertise.

In April of this year, Equinix achieved Advanced Technology Partner status in the AWS Partner Network. To obtain this status, AWS requires partners to meet stringent criteria, including the ability to demonstrate success in providing AWS services to a wide range of customers and use cases. Additionally, partners must complete a technical solution validation by AWS.

To help customers reduce costs, improve performance and achieve a more consistent network experience, Equinix offers AWS Direct Connect service in its IBX data centers in 21 markets globally, including the Amsterdam, Chicago, Dallas, Frankfurt, Helsinki, Los Angeles, London, Madrid, Manchester, Munich, New York, Osaka, São Paulo, Seattle, Silicon Valley, Singapore, Sydney, Tokyo, Toronto, Warsaw and Washington, D.C. metro areas.

Direct and private connectivity to strategic cloud providers that avoids the public internet is a growing business practice for leading companies. According to the Global Interconnection Index, a market study published recently by Equinix, the capacity for private data exchange between enterprises and cloud providers is forecast to grow at 160% CAGR between now and 2020.

Source: CloudStrategyMag

IDG Contributor Network: Are you treating your data as an asset?

IDG Contributor Network: Are you treating your data as an asset?

It’s a phrase we constantly hear, isn’t it? Data is a crucial business asset from which we can extract value and gain competitive advantage. Those who use data well will be the success stories of the future.

This got me thinking: If data is such a major asset, why do we hear so many stories about data leaks? Would these companies be quite so loose with other assets? You don’t hear about businesses losing hundreds of company cars or half a dozen buildings, do you?

If data is a potential asset, why aren’t companies treating it as such?

The reality is many businesses don’t treat data as an asset. In fact, it’s treated so badly there is increasing regulation forcing organizations to take better care of it. These external pressures have the potential to provide significant benefits, forcing a change in the way data is viewed across organizations from top to bottom. Forcing data to be treated as the asset it is.

If you can start to treat data as an asset, you can put yourself in a position where data really can provide a competitive advantage.

Where to start?

Clean up the mess

Do you have too little data in your organization? Probably not. In data discussion groups, a common refrain is that companies “have too much” and “it’s out of control.” Organizations are spending more and more resources on storing, protecting and securing it, but it’s not only the cost of keeping data that’s a problem. Tightening regulation will force you to clean up what you have.

It’s not an asset if you just keep collecting it and never do the housekeeping and maintenance that you should with any asset. If you don’t look after it, you will find it very difficult to realize value.

Your starting point is to ask yourself what you have, why you have it, and why you need it.

Gain some control

I talk regularly with people about the what, where, who and why of data. Understanding this will allow you to start to gain control of your asset.

Once it’s decided what your organization should have—and what you should be keeping—you need to understand exactly what you do have and, importantly, where it is stored: in data centers on laptops, on mobile devices or with cloud providers.

Next, the who and why. What other business asset does your company own that you wouldn’t know who’s using it and why? Yet companies seem to do this with data all the time. Look inside your own organization: Do you have a full understanding of who’s accessing your data…and why?

To treat our data like an asset, it’s crucial to understand how our data is been treated.

Build it the right home

As with any asset, data needs the right environment in which to thrive. Your organization no doubt offers decent working conditions for your employees, has a parking lot, provides regular maintenance for your car fleets and so on, doesn’t it? The same should be true for your data.

Consider your data strategy. Is it focused on the storage, media type or a particular vendor? Or are you building a modern, forward-thinking strategy focused on the data itself, and not the technology. This includes looking at how to ensure data is never siloed, can be placed in the right repository as needed, and can move seamlessly between repositories—be they on-prem, in the cloud or elsewhereyou’re your data always available? Can it be recovered quickly?

Build a strategy with a focus on the asset itself: the data.

Be ready to put it to work

To truly treat data as an asset, be prepared to sweat it like you would any other. If you can apply the things I’ve mentioned—cleanse it, gain control of it, have a data-focused strategy and have the right data in the right place—you can start to take advantage of tools that will allow you to gain value from it.

The ability to apply data analytics, machine learning, artificial intelligence and big data techniques to your assets allows you to not only understand your data better, but to begin to learn things from your data that you’d never previously been aware of…which is the most exciting opportunity data presents you.

Culture

All the above said, perhaps the best thing you can do for your data is to encourage a culture that is data-focused, one that realizes the importance of security and privacy, as well as understanding that data is crucial to your organization’s success.

If you can encourage and drive that cultural shift, there is every chance that your data will be treated as the asset it truly is—and you and your organization will be well-placed to reap the rewards that taking care of your data can bring.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Azure Databricks: Fast analytics in the cloud with Apache Spark

Azure Databricks: Fast analytics in the cloud with Apache Spark

We’re living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. It’s a flood of data that’s only going to get bigger as we add the sensors of the industrial internet of things, and the data that’s needed to deliver even the simplest predictive-maintenance systems.

Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. That’s where big data analysis tools like Apache Spark come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like Hadoop, today’s big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.

Apache Spark is fast, but Databricks is faster. Founded by the Spark team, Databricks is a cloud-optimized version of Spark that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like Jupyter Notebooks.

Microsoft’s new support for Databricks on Azure—called Azure Databricks—signals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.

Although you’ve always been able to install Spark or Databricks on Azure, Azure Databricks makes it a one-click experience, driving the setup process from the Azure Portal. You can host multiple analytical clusters, using autoscaling to minimize the resources in use. You can clone and edit clusters, tuning them for specific jobs or running different analyses on the same underlying data.

Configuring the Azure Databricks virtual appliance

The heart of Microsoft’s new service is a managed Databricks virtual appliance built using containers running on Azure Container Services. You choose the number of VMs in each cluster that it controls and uses, and then the service handles load automatically once it’s configured and running, loading new VMs to handle scaling.

Databricks’ tools interact directly with the Azure Resource Manager, which adds a security group and a dedicated storage account and virtual network to your Azure subscription. It lets you use any class of Azure VM for your Databricks cluster – so if you’re planning on using it to train machine learning systems, you’ll want to choose one of the latest GPU-based VMs. And of course, if one VM model isn’t right for your problem, you can switch it out for another. All you need to do is clone a cluster and change the VM definitions.

Querying in Spark brings engineering to data science

Spark has its own query language based on SQL, which works with Spark DataFrames to handle both structured and unstructured data. DataFrames are the equivalent of a relational table, constructed on top of collections of distributed data in different stores. Using named columns, you can construct and manipulate DataFrames with languages like R and Python; thus, both developers and data scientists can take advantage of them.

DataFrames is essentially a domain-specific language for your data, a language that extends the data analysis features of your chosen platform. By using familiar libraries with DataFrames, you can construct complex queries that take data from multiple sources, working across columns.

Because Azure Databricks is inherently data-parallel, and its queries are evaluated only when called to deliver actions, results can be delivered very quickly. Because Spark supports most common data sources, either natively or through extensions, you can add Azure Databricks DataFrames and queries to existing data relatively easily, reducing the need to migrate data to take advantage of its capabilities.

Although Azure Databricks provides a high-speed analytics layer across multiple sources, it’s also a useful tool for data scientists and developers trying to build and explore new models, turning data science into data engineering. Using Databricks Notebooks, you can develop scratchpad views of your data, with code and results in a single view.

The resulting notebooks are shared resources, so anyone can use them to explore their data and try out new queries. Once a query is tested and turned into a regular job, its output can be exposed as an element a Power BI dashboard, making Azure Databricks part of an end-to-end data architecture that allows more complex reporting than a simple SQL or NoSQL service—or even Hadoop.

Microsoft plus Databricks: a new model for Azure Services

Microsoft hasn’t yet detailed its pricing for Azure Databricks, but it does claim that it can improve performance and reduce cost by as much as 99 percent compared to running your own unmanaged Spark installation on Azure’s infrastructure services. If Microsoft’s claim bears out, that promises to be a significant saving, especially when you factor in no longer having to run your own Spark infrastructure.

Azure’s Databricks service will connect directly to Azure storage services, including Azure Data Lake, with optimizations for queries and caching. There’s also the option of using it with Cosmos DB, so you can take advantage of global data sources and a range of NoSQL data models, including MongoDB and Cassandra compatibility—as well as Cosmos DB’s graph APIs. It should also work well with Azure’s data-streaming tools, giving you a new option for near real-time IoT analytics.

If you’re already using Databricks’ Spark tools, this new service won’t affect you or your relationship with Databricks. It’s only if you take the models and analytics you’ve developed on-premises to Azure’s cloud that you’ll get a billing relationship with Microsoft. You’ll also have fewer management tasks, leaving you more time to work with your data.

Microsoft’s decision to work with an expert partner on a new service makes a lot of sense. Databricks has the expertise, and Microsoft has the platform. If the resulting service is successful, it could set a new pattern for how Azure evolves in the future, building on what businesses are already using and making them part of the Azure hybrid cloud without absorbing those services into Microsoft.

Source: InfoWorld Big Data