IDG Contributor Network: ETL is dead

IDG Contributor Network: ETL is dead

Extract, transform, and load. It doesn’t sound too complicated. But, as anyone who’s managed a data pipeline will tell you, the simple name hides a ton of complexity.

And while none of the steps are easy, the part that gives data engineers nightmares is the transform. Taking raw data, cleaning it, filtering it, reshaping it, summarizing it, and rolling it up so that it’s ready for analysis. That’s where most of your time and energy goes, and it’s where there’s the most room for mistakes.

If ETL is so hard, why do we do it this way?

The answer, in short, is because there was no other option. Data warehouses couldn’t handle the raw data as it was extracted from source systems, in all its complexity and size. So the transform step was necessary before you could load and eventually query data. The cost, however, was steep.

Rather than maintaining raw data that could be transformed into any possible end product, the transform shaped your data into an intermediate form that was less flexible. You lost some of the data’s resolution, imposed the current version of your business’ metrics on the data, and threw out useless data.

And if any of that changed—if you needed hourly data when previously you’d only processed daily data, if your metric definitions changed, or some of that “useless” data turned out to not be so useless after all—then you’d have to fix your transformation logic, reprocess your data, and reload it.

The fix might take days or weeks

It wasn’t a great system, but it’s what we had.

So as technologies change and prior constraints fall away, it’s worth asking what we would do in an ideal world—one where data warehouses were infinitely fast and could handle data of any shape or size. In that world, there’d be no reason to transform data before loading it. You’d extract it and load it in its rawest form.

You’d still want to transform the data, because querying low-quality, dirty data isn’t likely to yield much business value. But your infinitely fast data warehouse could handle that transformation right at query time. The transformation and query would all be a single step. Think of it as just-in-time transformation. Or ELT.

The advantage of this imaginary system is clear: You wouldn’t have to decide ahead of time which data to discard or which version of your metric definitions to use. You’d always use the freshest version of your transformation logic, giving you total flexibility and agility.

So, is that the world we live in? And if so, should we switch to ELT?

Not quite. Data warehouses have indeed gotten several orders of magnitude faster and cheaper. Transformations that used to take hours and cost thousands of dollars now take seconds and cost pennies. But they can still get bogged down with misshapen data or huge processes.

So there’s still some transformation that’s best accomplished outside the warehouse. Removing irrelevant or dirty data, and doing heavyweight reshaping, is still often a preloading process. But this initial transform is a much smaller step and thus much less likely to need updating down the road.

Basically, it’s gone from a big, all-encompassing ‘T’ to a much smaller ‘t’

Once the initial transform is done, it’d be nice to move the rest of the transform to query time. But especially with larger data volumes, the data warehouses still aren’t quite fast enough to make that workable. (Plus, you still need a good way to manage the business logic and impose it as people query.)

So instead of moving all of that transformation to query time, more and more companies are doing most of it in the data warehouse—but they’re doing it immediately after loading. This gives them lots more agility than in the old system, but maintains tolerable performance. For now, at least, this is where the biggest “T” is happening.

The lightest-weight transformations—the ones the warehouses can do very quickly—are happening right at query time. This represents another small “t,” but it has a very different focus than the preloading “t.” That’s because these lightweight transformations often involve prototypes of new metrics and more ad hoc exploration, so the total flexibility that query-time transformation provides is ideal.

In short, we’re seeing a huge shift that takes advantage of new technologies to make analytics more flexible, more responsive, and more performant. As a result, employees are making better decisions using data that was previously slow, inaccessible, or worst of all, wrong. And the companies that embrace this shift are outpacing rivals stuck in the old way of doing things.

ETL? ETL is dead. But long live … um … EtLTt?

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Cisco Updates Its SDN Solution

Cisco Updates Its SDN Solution

Cisco has announced updates to its Application Centric Infrastructure (Cisco ACI™), a software-defined networking (SDN) solution deigned to make it easier for customers to adopt and advance intent-based networking for their data centers. With the latest software release (ACI 3.0), more than 4,000 ACI customers can increase business agility with network automation, simplified management, and improved security for any combination of workloads in containers, virtual machines and bare metal for private clouds, and on-premise data centers.

The transitions occurring in the data center are substantial. Enterprises experience an unrelenting need to accelerate speed, flexibility, security and scale across increasingly complex data centers and multi-cloud environments.

“As our customers shift to multi-cloud strategies, they are seeking ways to simplify the management and scalability of their environments,” said Ish Limkakeng, senior vice president for data center networking at Cisco. “By automating basic IT operations with a central policy across multiple data centers and geographies, ACI’s new multi-site management capability helps network operators more easily move and manage workloads with a single pane of glass — a significant step in delivering on Cisco’s vision for enabling ACI Anywhere.”

The new ACI 3.0 software release is now available. New features include:

Multi-site Management: Customers can seamlessly connect and manage multiple ACI fabrics that are geographically distributed to improve availability by isolating fault domains, and provide a global view of network policy through a single management portal. This greatly simplifies disaster recovery and the ability to scale out applications.

Kubernetes Integration: Customers can deploy their workloads as micro-services in containers, define ACI network policy for these through Kubernetes, and get unified networking constructs for containers, virtual machines, and bare-metal. This brings the same level of deep integration to containers ACI has had with numerous hypervisors.

Improved Operational Flexibility and Visibility: The new Next Gen ACI User Interface improves usability with new consistent layouts and simplified topology views, and troubleshooting wizards. In addition, ACI now includes graceful insertion and removal, support for mixed operating systems and quota management, and latency measurements across fabric end points for troubleshooting.

Security: ACI 3.0 delivers new capabilities to protect networks by mitigating attacks such as IP/MAC spoofing with First Hop Security integration, automatically authenticating workloads in-band and placing them in trusted security groups, and support for granular policy enforcement for end points within the same security group.

“With ‘ACI Anywhere,’ Cisco is delivering a scalable solution that will help position customers for success in multi-cloud and multi-site environments,” said Dan Conde, an analyst with Enterprise Strategy Group. “ACI’s new integration with container cluster managers and its enhancements to zero trust security make this a modern offering for the market, whether you are a large Service Provider, Enterprise, or a commercial customer.”

Source: CloudStrategyMag

UKCloud Launches Cloud GPU Services

UKCloud Launches Cloud GPU Services

UKCloud has announced the launch of its Cloud GPU computing service based on NVIDIA virtual GPU solutions with NVIDIA Tesla P100 and M60 GPUs (graphics processing units). The service will support computational and visualisation intensive workloads for UKCloud’s UK public sector and health care customers. UKCloud is not only the first Cloud Service Provider based in the UK or Europe to offer Cloud GPU computing services with NVIDIA GPUs, but is also the only provider specialising in public sector and health care and the specific needs of these customers.

“Building on the foundation of UKCloud’s secure, assured, UK-Sovereign platform, we are now able to offer a range of cloud-based compute, storage and GPU services to meet our customers’ complex workload requirements,” said Simon Hansford, CEO, UKCloud. “The public sector is driving more complex computational and visualisation intensive workloads than ever before, not only for CAD development packages, but also for tasks like the simulation of infrastructure changes in transport, for genetic sequencing in health or for battlefield simulation in defence. In response to this demand, we have a greater focus on emerging technologies such as deep learning, machine learning and artificial intelligence.”

Many of today’s modern applications, especially in fields such as medical imaging or graphical analytics, need an NVIDIA GPU to power them, whether they are running on a laptop or desktop, on a departmental server or on the cloud. Just as organisations are finding that their critical business applications can be run more securely and efficiently in the cloud, so too they are realising that it makes sense to host graphical and visualisation intensive workloads there as well.

Adding cloud GPU computing services utilising NVIDIA technology to support more complex computational and visualisation intensive workloads was a customer requirement captured via UKCloud Ideas, a service that was introduced as part of UKCloud’s maniacal focus on customer service excellence. UKCloud Ideas proactively polls its clients for ideas and wishes for service improvements, enabling customers to vote on ideas and drive product improvements across the service. This has facilitated more than 40 feature improvements in the last year across UKCloud’s service catalogue from changes to the customer portal to product specific improvements.

One comment came from a UKCloud partner with many clients needing GPU capability: “One of our applications includes 3D functionality which requires a graphics card. We have several customers who might be interested in a hosted solution but would require access to this functionality. To this end it would be helpful if UKCloud were able to offer us a solution which included a GPU.”

Listening to its clients in this way and acting on their suggestions to improve its service by implementing NVIDIA GPU technology was one of a number of initiatives that enabled UKCloud to win a 2017 UK Customer Experience Award for putting customers at the heart of everything, through the use of technology.

“The availability of NVIDIA GPUs in the cloud means businesses can capitalise on virtualisation without compromising the functionality and responsiveness of their critical applications,” added Bart Schneider, Senior Director of CSP EMEA at NVIDIA. “Even customers running graphically complex or compute-intensive applications can benefit from rapid turn-up, service elasticity and cloud-economics.”

UKCloud’s GPU-accelerated cloud service, branded as Cloud GPU, is available in two versions: Compute and Visualisation. Both are based on NVIDIA GPUs and initially available only on UKCloud’s Enterprise Compute Cloud platform. They will be made available on UKCloud’s other platforms at a later date. The two versions are as follows:

  • UKCloud’s Cloud GPU Compute: This is a GPU accelerated computing service, based on the NVIDIA Tesla P100 GPU and supports applications developed using NVIDIA CUDA, that enables parallel co-processing on both the CPU and GPU. Typical use cases include looking for cures, trends and research findings in medicine along with genomic sequencing, data mining and analytics in social engineering, and trend identification and predictive analytics in business or financial modelling and other applications of AI and deep learning. Available from today with all VM sizes, Cloud GPU Compute will represent an additional cost of £1.90 per GPU per hour on top of the cost of the VM.
  • UKCloud’s Cloud GPU Visualisation: This is a virtual GPU (vGPU) service, utilising the NVIDIA Tesla M60, that extends the power of NVIDIA GPU technology to virtual desktops and apps. In addition to powering remote workspaces, typical use cases include military training simulations and satellite image analysis in defence, medical imaging and complex image rendering. Available from the end of October with all VM sizes, Cloud GPU Visualisation will represent an additional cost of £0.38 per vGPU per hour on top of the cost of the VM.

UKCloud has also received a top accolade from NVIDIA, that of ‘2017 Best Newcomer’ in the EMEA partner awards that were announced at NVIDIA’s October GPU Technology Conference 2017 in Munich. UKCloud was commended for making GPU technology more accessible for the UK public sector. As the first European Cloud Service Provider with NVIDIA GPU Accelerated Computing, UKCloud is helping to accelerate the adoption of Artificial Intelligence across all areas of the public sector, from central and local government to defence and healthcare, by allowing its customers and partners to harness the awesome power of GPU compute, without having to build specific rigs.

Source: CloudStrategyMag

Alibaba Cloud Joins Red Hat Certified Cloud And Service Provider Program

Alibaba Cloud Joins Red Hat Certified Cloud And Service Provider Program

Red Hat, Inc. and Alibaba Cloud have announced that they will join forces to bring the power and flexibility of Red Hat’s open source solutions to Alibaba Cloud’s customers around the globe.

Alibaba Cloud is now part of the Red Hat Certified Cloud and Service Provider program, joining a group of technology industry leaders who offer Red Hat-tested and validated solutions that extend the functionality of Red Hat’s broad portfolio of open source cloud solutions. The partnership extends the reach of Red Hat’s offerings across the top public clouds globally, providing a scalable destination for cloud computing and reiterating Red Hat’s commitment to providing greater choice in the cloud.

“Our customers not only want greater performance, flexibility, security and portability for their cloud initiatives; they also want the freedom of choice for their heterogeneous infrastructures. They want to be able to deploy their technologies of choice on their scalable infrastructure of choice. That is Red Hat’s vision and the focus of the Red Hat Certified Cloud and Service Provider Program. By working with Alibaba Cloud, we’re helping to bring more choice and flexibility to customers as they deploy Red Hat’s open source solutions across their cloud environments,” said Mike Ferris, vice president, technical business development and business architecture, Red Hat.

In the coming months, Red Hat solutions will be available directly to Alibaba Cloud customers, enabling them to take advantage of the full value of Red Hat’s broad portfolio of open source cloud solutions. Alibaba Cloud intends to offer Red Hat Enterprise Linux in a pay-as-you-go model in the Alibaba Cloud Marketplace.

By joining the Red Hat Certified Cloud and Service Provider program, Alibaba Cloud has signified that it is a destination for Red Hat customers, independent software vendors (ISVs) and partners to enable them to benefit from Red Hat offerings in public clouds. These will be provided under innovative consumption and service models with the greater confidence that Red Hat product experts have validated the solutions.

“As enterprises in China, and throughout the world, look to modernize application environments, a full-lifecycle solution by Red Hat on Alibaba Cloud can provide customers higher flexibility and agility. We look forward to working with Red Hat to help enterprise customers with their journey of scaling workloads to Alibaba Cloud.,” said Yeming Wang, deputy general manager of Alibaba Cloud Global, Alibaba Cloud.

Launched in 2009, the Red Hat Certified Cloud and Service Provider Program is designed to assemble the solutions cloud providers need to plan, build, manage, and offer hosted cloud solutions and Red Hat technologies to customers. The Certified Cloud Provider designation is awarded to Red Hat partners following validation by Red Hat. Each provider meets testing and certification requirements to demonstrate that they can deliver a safe, scalable, supported, and consistent environment for enterprise cloud deployments.

In addition, in the coming months, Red Hat customers will be able to move eligible, unused Red Hat subscriptions from their datacenter to Alibaba Cloud, China’s largest public cloud service provider, using Red Hat Cloud Access. Red Hat Cloud Access is an innovative “bring-your-own-subscription” benefit available from select Red Hat Certified Cloud and Service Providers that enables customers to move eligible Red Hat subscriptions from on-premise to public clouds. Red Hat Cloud Access also enables customers to maintain a direct relationship with Red Hat – including the ability to receive full support from Red Hat’s award-winning Global Support Services organization, enabling customers to maintain a consistent level of service across certified hybrid deployment infrastructures.

Source: CloudStrategyMag

Edgeconnex® Enables Cloudflare Video Streaming Service

Edgeconnex® Enables Cloudflare Video Streaming Service

EdgeConneX® has announced a new partnership with Cloudflare to enable and deploy its new Cloudflare Stream service. The massive Edge deployment will roll out in 18 Edge Data Centers® (EDCs) across North America and Europe, enabling Cloudflare to bring data within a few milliseconds of local market endusers and providing fast and effective delivery of bandwidth-intensive content.

Cloudflare powers more than 10% of all Internet requests and ensures that web properties, APIs and applications run efficiently and stay online. On September 27, 2017, exactly seven years after the company’s launch, Cloudflare expanded its offerings with Cloudflare Stream, a new service that combines encoding and global delivery to form a solution for the technical and business issues associated with video streaming. By deploying Stream at all of Cloudflare’s edge nodes, Cloudflare is providing customers the ability to integrate high-quality, reliable streaming video into their applications.

In addition to the launch of Stream, Cloudflare is rolling out four additional new services: Unmetered Mitigation, which eliminates surge pricing for DDoS mitigation; Geo Key Manager, which provides customers with granular control over where they place their private keys; Cloudflare Warp, which eliminates the effort required to fully mask and protect an application; and Cloudflare Workers, which writes and deploys JavaScript code at the edge. As part of its ongoing global expansion, Cloudflare is launching with EdgeConneX to serve more customers with fast and reliable web services.

“We think video streaming will be a ubiquitous component within all websites and apps in the future, and it’s our immediate goal to expand the number of companies that are streaming video from 1,000 to 100,000,” explains Matthew Prince, co-founder and CEO, Cloudflare. “Combined with EdgeConneX’s portfolio of Edge Data Centers, our technology enables a global solution across all 118 of our points of presence, for the fastest and most secure delivery of video and Internet content.”

In order to effectively deploy its services, including the newly launched Stream solution, Cloudflare is allowing customers to run basic software at global facilities located at the Edge of the network. To achieve this, Cloudflare has selected EdgeConneX to provide fast and reliable content delivery to end users. When deploying Stream and other services in EDCs across North America and Europe, Cloudflare will utilize this massive Edge deployment to further enhance its service offerings.

Cloudflare’s performance gains from EdgeConneX EDCs have been verified by Cedexis, the leader in latency-based load balancing for content and cloud providers. Their panel of Real User Measurement data showed significant response time improvements immediately following the EdgeConneX EDC deployments — 33% in the Minneapolis metro area and 20% in the Portland metro area.

“When it comes to demonstrating the effectiveness of storing data at an EdgeConneX EDC, the numbers speak for themselves,” says Clint Heiden, chief commercial officer, EdgeConneX. “We look forward to continuing our work with Cloudflare to help them deliver a wide range of cutting-edge services to their customer base, including Cloudflare Stream.”

Source: CloudStrategyMag

IDG Contributor Network: AI and quantum computing: technology that's fueling innovation and solving future problems

IDG Contributor Network: AI and quantum computing: technology that's fueling innovation and solving future problems

Two weeks ago, I spent time in Orlando, Florida, attending Microsoft’s huge IT pro and developer conference known as Microsoft Ignite. Having the opportunity to attend events such as this to see the latest in technological advancements is one of the highlights of my job. Every year, I am amazed at what new technologies are being made available to us. The pace of innovation has increased exponentially over the last five years. I can only imagine what the youth of today will bring to this world as our next generation’s creators.

Microsoft’s CEO, Satya Nadella, kicked off the vision keynote on Day 1. As always, he gets the crowd pumped up with his inspirational speeches. If you saw Satya’s keynote last year, you could almost bet on what he was going to be talking about this year. His passion, and Microsoft’s mission, is to empower every person and every organization on the planet to achieve more. This is a bold statement, but one that I believe is possible. He also shared how Microsoft is combining cloud, artificial intelligence, and mixed reality across their product portfolio to help customers innovate and build the future of business. This was followed by a demonstration of how Ford Motor was able to use these technologies to improve product design and engineering and innovate at a much faster pace today. It’s clear to me that AI is going to be a core part of our lives as we continue to evolve with this technology.

The emergence of machine learning business models based on the use of the cloud is in fact a big factor for why AI is taking off. Prior to the cloud, AI projects had high costs abut cloud economics have rendered certain machine learning capabilities relatively inexpensive and less complicated to operate. Thanks to the integration of cloud and AI, very specialized artificial intelligence startups are exploding in growth. Besides the aforementioned Microsoft, AI projects and initiatives at tech giants such as Facebook, Google, and Apple are also exploding.

As we move forward, the potential for these technologies to help people in ways that we have never been able to before is going to become more of a reality than a dream. Technologies such as AI, serverless computing, containers, augmented reality, and, yes, quantum computing will fundamentally change how we do things and fuel innovation at a pace faster than ever before.

One of the most exciting moments that had everyone’s attention at Ignite was when Microsoft shared what it has been doing around quantum computing. We’ve heard about this, but is it real? The answer is yes. Other influential companies such as IBM and Google are investing resources in this technology as well. It’s quite complex but very exciting. To see a technology like this come to fruition and make an impact in my lifetime would be nothing short of spectacular.

Moore’s Law states the number of transistors on a microprocessor will double every 18 months. Today, traditional computers store data as binary digits represented by either a 1 or 0 to signify a state of on or off. With this model, we have come a long way from the early days of computing power, but there is still a need for even faster and more powerful processing. Intel is already working with 10-nanometer manufacturing process technology, code-named Cannon Lake, that will offer reduced power consumption, higher density, and increased performance. In the very near future circuits will have to be measured on an atomic scale. This is where quantum computing comes in.

I’m not an expert in this field, but I have taken an interest in this technology as I have a background in electronics engineering. In simple terms—quantum computing harnesses the power of atoms and molecules to perform memory and processing tasks. Quantum computing is combining the best of math, physics, and computer science using what is referred to as electron fractionalization.

Quantum computers aren’t limited to only two states. They encode information using quantum bits, otherwise known as qubits. This involves being able to store data as both 1s and 0s, known as superposition, at the same time which unlocks parallelism. That probably doesn’t tell you much but think of it this way: This technology could enable us to solve complex problems in hours or days that would normally take billions of years with traditional computers. Think about that for a minute and you will realize just how significant this could be. This could enable researchers to develop and simulate new materials, improve medicines, accelerate AI and solve world hunger and global warming. Quantum computing will help us solve the impossible.

There are some inherent challenges with quantum computing. If you try to look at a qubit you risk bumping it, thereby causing its value to change. Scientists have devised ways to observe these quantum superpositions without destroying them. This is done by using cryogenics to cool the quantum chips down to temperatures in the range of 0.01ºK (–459.65ºF) where there are no vibrations to interfere with measurements.

Soon, developers will be able to test algorithms by running them in a local simulator on your computer, simulating around 30 qubits, or in Azure simulating around 40 quibits. As companies such as Microsoft, Google, and IBM continue to develop technologies such as this, dreams of quantum computing are becoming a reality. This technological innovation is not about who is the first to prove the value of quantum computing. This is about solving real world problems for our future generations in hopes of a better world.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: The rise and predominance of Apache Spark

IDG Contributor Network: The rise and predominance of Apache Spark

Initially open-sourced in 2012 and followed by its first stable release two years later, Apache Spark quickly became a prominent player in the big data space. Since then, its adoption by big data companies has been on the rise at an eye-catching rate.

In-memory processing

Undoubtedly a key feature of Spark, in-memory processing, is what makes the technology deliver the speed that dwarfs performance of conventional big data processing. But in-memory processing isn’t a new computing concept, and there is a long list of database and data-processing products with an underlying design of in-memory processing. Redis and VoltDB are a couple of examples. Another example is Apache Ignite, which is also equipped with in-memory processing capability supplemented by a WAL (write-ahead log) to address performance of big data queries and ACID (atomicity, consistency, isolation, durability) transactions.

Evidently, the functionality of in-memory processing alone isn’t quite sufficient to differentiate a product from others. So, what makes Spark stand out from the rest in the highly competitive big data processing arena?

BI/OLAP at scale with speed

For starters, I believe Spark successfully captures a sweet spot that few other products do. The need for the ever demanding high-speed BI (business intelligence) analytics has, in a sense, started to blur the boundary between the OLAP (online analytical processing) and OLTP (online transaction processing) worlds.

On one hand, we have distributed computing platforms such as Hadoop providing a MapReduce programming model, in addition to its popular distributed file system (HDFS). While MapReduce is a great data processing methodology, it’s a batch process that doesn’t deliver results in a timely manner.

On the other hand, there are big data processing products addressing the need of OLTP. Examples of products in this category include Phoenix on HBase, Apache Drill, and Ignite. Some of these products provide a query engine that emulates standard SQL’s transactional processing functionality to various extent to apply to key-value based or column-oriented databases.

What was missing but in high demand in the big data space is a product that does batch OLAP at scale with speed. There is indeed a handful of BI analytics/OLAP products such as Apache Kylin and Presto. Some of these products manage to fill the gap with some success in the very space. But it’s Spark that has demonstrated success in simultaneously addressing both speed and scale.

Nevertheless, Spark isn’t the only winner in the ‘speed + scale’ battle. Emerged around the same time as Apache Spark did, Impala (now an Apache incubator project) has also demonstrated remarkable performance in both speed and scale in its recent release. Yet, it has never achieved the same level of popularity as Spark does. So, something else in Spark must have made it more appealing to contemporary software engineers.

Immutable data with functional programming

Apache Spark provides API for three types of dataset: RDDs (resilient distributed data) are immutable distributed collection of data manipulatable using functional transformations (map, reduce, filter, etc.), DataFrames are immutable distributed collections of data in a table-like form with named columns and each row a generic untyped JVM objects called Row, and Datasets are collections of strongly-typed JVM objects.

Regardless of the API you elect to use, data in Spark is immutable and changes applied to the data are via compositional functional transformations. In a distributed computing environment, data immutability is highly desirable for concurrent access and performance at scale. In addition, such approach in formulating and resolving data processing problem in the functional programming style has been favored by many software engineers and data scientists these days.

On MapReduce, Spark provides an API using implementation of map(), flatMap()>, groupBy(), reduce() in classic functional programming language such as Scala. These methods can be applied to datasets in a compositional fashion as a sequence of data transformations, bypassing the need of coding modules of mappers and reducers as in conventional MapReduce.

Spark is “lazy”

An underlying design principle that plays a pivotal role in the operational performance of Spark is “laziness.” Spark is lazy in the sense that it holds off actual execution of transformations until it receives requests for resultant data to be returned to the driver program (i.e., the submitted application that is being serviced in an active execution context).

Such execution strategy can significantly minimize disk and network I/O, enabling it to perform well at scale. For example, in a MapReduce process, rather than returning the high-volume of data generated through map that is to be consumed by reduce, Spark may elect to return only the much smaller resultant data from reduce to the driver program.

Cluster and programming language support

As a distributed computing framework, robust cluster management functionality is essential for scaling out horizontally. Spark has been known for its effective use of available CPU cores on over thousands of server nodes. Besides the default standalone cluster mode, Spark also supports other clustering managers including Hadoop YARN and Apache Mesos.

On programming languages, Spark supports Scala, Java, Python, and R. Both Scala and R are functional programming languages at their heart and have been increasingly adopted by the technology industry in general. Programming in Scala on Spark feels like home given that Spark itself is written in Scala, whereas R is primarily tailored for data science analytics.

Python, with its popular data sicence libraries like NumPy, is perhaps one of the fastest growing programming language partly due to the increasing demand in data science work. Evidently, Spark’s Python API (PySpark) has been quickly adopted in volume by the big data community. Interoperable with NumPy, Spark’s machine learning library MLlib built on top of its core engine has helped fuel enthusiasm from the data science community.

On the other hand, Java hasn’t achieved the kind of success Python enjoys on Spark. Apparently the Java API on Spark feels like an afterthought. I’ve seen on a few occasions something rather straight forward using Scala needs to be worked around with lengthy code in Java on Spark.

Power of SQL and user-defined functions

SQL-compliant query capability is a significant part of the Spark’s strength. Recent releases of Spark API support SQL 2003 standard. One of the most sought-after query features is the window functions, which are not even available in some major SQL-based RDBMS like MySQL. Window functions enable one to rank or aggregate rows of data over a sliding window of rows that help minimize expensive operations such as joining of DataFrames.

Another important feature of Spark API’s are user-defined functions (UDF), which allow one to create custom functions that leverage the vast amount of general-purpose functions available on the programming language to apply to the data columns. While there is a handful of functions specific for the DataFrame API, with UDF one can expand to using of virtually any methods available, say, in the Scala programming language to assemble custom functions.

Spark streaming

In the scenario that data streaming is an requirement on top of building an OLAP system, the necessary integration effort could be challenging. Such integration generally requires not only involving a third-party streaming library, but also making sure that the two disparate APIs will cooperatively and reliably work out the vast difference in latency between near-real-time and batch processing.

Spark provides a streaming library that offers fault-tolerant distributed streaming functionality. It performs streaming by treating small contiguous chunks of data as a sequence of RDDs which are Spark’s core data structure. The inherent streaming capability undoubtedly alleviates the burden of having to integrate high-latency batch processing tasks with low-latency streaming routines.

Visualization, and beyond

Last but not least, Spark’s web-based visual tools reveal detailed information related to how a data processing job is performed. Not only do the tools show you the break-down of the tasks on individual worker nodes of the cluster, they also give details down to the life cycle of the individual execution processes (i.e., executors) allocated for the job. In addition, Spark’s visualization of complex job flow in the form of DAG (directed acyclic graph) offers in-depth insight into how a job is executed. It’s especially useful in troubleshooting or performance-tuning an application.

So, it isn’t just one or two things among the long list of in-memory processing speed, scalability, addressing of the BI/OLAP niche, functional programming style, data immutability, lazy execution strategy, appeal to the rising data science community, robust SQL capability and task visualization, etc. that propel Apache Spark to be a predominant frontrunner in the big data space. It’s the collective strength of the complementary features that truly makes Spark stand out from the rest.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

General Electric Names AWS Its Preferred Cloud Provider

General Electric Names AWS Its Preferred Cloud Provider

Amazon Web Services, Inc. has announced that General Electric has selected AWS as its preferred cloud provider. GE continues to migrate thousands of core applications to AWS. GE began an enterprise-wide migration in 2014, and today many GE businesses, including GE Power, GE Aviation, GE Healthcare, GE Transportation, and GE Digital, run many of their cloud applications on AWS. Over the past few years, GE migrated more than 2,000 applications, several of which leverage AWS’s analytics and machine learning services.

“Adopting a cloud-first strategy with AWS is helping our IT teams get out of the business of building and running data centers and refocus our resources on innovation as we undergo one of the largest and most important transformations in GE’s history,” said Chris Drumgoole, chief technology officer and corporate vice president, General Electric. “We chose AWS as the preferred cloud provider for GE because AWS’s industry leading cloud services have allowed us to push the boundaries, think big, and deliver better outcomes for GE.”

“Enterprises across industries are migrating to AWS in droves, and in the process are discovering the wealth of new opportunities that open up when they have the most comprehensive menu of cloud capabilities — which is growing daily — at their fingertips,” said Mike Clayville, vice president, worldwide commercial sales, AWS. “GE has been at the forefront of cloud adoption, and we’ve been impressed with the pace, scope, and innovative approach they’ve taken in their journey to AWS. We are honored that GE has chosen AWS as their preferred cloud provider, and we’re looking forward to helping them as they continue their digital industrial transformation.”

Source: CloudStrategyMag