What is Apache Spark? The big data analytics platform explained

What is Apache Spark? The big data analytics platform explained

From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. You’ll find it used by banks, telecommunications companies, games companies, governments, and all of the major tech giants such as Apple, Facebook, IBM, and Microsoft.

Out of the box, Spark can run in a standalone cluster mode that simply requires the Apache Spark framework and a JVM on each machine in your cluster. However, it’s more likely you’ll want to take advantage of a resource or cluster management system to take care of allocating workers on demand for you. In the enterprise, this will normally mean running on Hadoop YARN (this is how the Cloudera and Hortonworks distributions run Spark jobs), but Apache Spark can also run on Apache Mesos, while work is progressing on adding native support for Kubernetes.

If you’re after a managed solution, then Apache Spark can be found as part of Amazon EMR, Google Cloud Dataproc, and Microsoft Azure HDInsight. Databricks, the company that employs the founders of Apache Spark, also offers the Databricks Unified Analytics Platform, which is a comprehensive managed service that offers Apache Spark clusters, streaming support, integrated web-based notebook development, and optimized cloud I/O performance over a standard Apache Spark distribution.

Spark vs. Hadoop

It’s worth pointing out that Apache Spark vs. Apache Hadoop is a bit of a misnomer. You’ll find Spark included in most Hadoop distributions these days. But due to two big advantages, Spark has become the framework of choice when processing big data, overtaking the old MapReduce paradigm that brought Hadoop to prominence.

The first advantage is speed. Spark’s in-memory data engine means that it can perform tasks up to one hundred times faster than MapReduce in certain situations, particularly when compared with multi-stage jobs that require the writing of state back out to disk between stages. Even Apache Spark jobs where the data cannot be completely contained within memory tend to be around 10 times faster than their MapReduce counterpart.

The second advantage is the developer-friendly Spark API. As important as Spark’s speed-up is, one could argue that the friendliness of the Spark API is even more important.

Spark Core

In comparison to MapReduce and other Apache Hadoop components, the Apache Spark API is very friendly to developers, hiding much of the complexity of a distributed processing engine behind simple method calls. The canonical example of this is how almost 50 lines of MapReduce code to count words in a document can be reduced to just a few lines of Apache Spark (here shown in Scala):

val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts = textFile.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs:///tmp/words_agg”)

By providing bindings to popular languages for data analysis like Python and R, as well as the more enterprise-friendly Java and Scala, Apache Spark allows everybody from application developers to data scientists to harness its scalability and speed in an accessible manner.

Spark RDD

At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing.

RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.

Spark runs in a distributed fashion by combining a driver core process that splits a Spark application into tasks and distributes them among many executor processes that do the work. These executors can be scaled up and down as required for the application’s needs.

Spark SQL

Originally known as Shark, Spark SQL has become more and more important to the Apache Spark project. It is likely the interface most commonly used by today’s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compliant interface for querying data, bringing the power of Apache Spark to analysts as well as developers.

Alongside standard SQL support, Spark SQL provides a standard interface for reading from and writing to other datastores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem.

Selecting some columns from a dataframe is as simple as this line:

citiesDF.select(“name”, “pop”)

Using the SQL interface, we register the dataframe as a temporary table, after which we can issue SQL queries against it:

citiesDF.createOrReplaceTempView(“cities”)
spark.sql(“SELECT name, pop FROM cities”)

Behind the scenes, Apache Spark uses a query optimizer called Catalyst that examines data and queries in order to produce an efficient query plan for data locality and computation that will perform the required calculations across the cluster. In the Apache Spark 2.x era, the Spark SQL interface of dataframes and datasets (essentially a typed dataframe that can be checked at compile time for correctness and take advantage of further memory and compute optimizations at run time) is the recommended approach for development. The RDD interface is still available, but is recommended only if you have needs that cannot be encapsulated within the Spark SQL paradigm.

Spark MLlib

Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. Spark MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset. MLLib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline for production use.

Note that while Spark MLlib covers basic machine learning including classification, regression, clustering, and filtering, it does not include facilities for modeling and training deep neural networks (for details see InfoWorld’s Spark MLlib review). However, Deep Learning Pipelines are in the works.

Spark GraphX

Spark GraphX comes with a selection of distributed algorithms for processing graph structures including an implementation of Google’s PageRank. These algorithms use Spark Core’s RDD approach to modeling data; the GraphFrames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries.

Spark Streaming

Spark Streaming was an early addition to Apache Spark that helped it gain traction in environments that required real-time or near real-time processing. Previously, batch and stream processing in the world of Apache Hadoop were separate things. You would write MapReduce code for your batch processing needs and use something like Apache Storm for your real-time streaming requirements. This obviously leads to disparate codebases that need to be kept in sync for the application domain despite being based on completely different frameworks, requiring different resources, and involving different operational concerns for running them.

Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches, which could then be manipulated using the Apache Spark API. In this way, code in batch and streaming operations can share (mostly) the same code, running on the same framework, thus reducing both developer and operator overhead. Everybody wins.

A criticism of the Spark Streaming approach is that microbatching, in scenarios where a low-latency response to incoming data is required, may not be able to match the performance of other streaming-capable frameworks like Apache Storm, Apache Flink, and Apache Apex, all of which use a pure streaming method rather than microbatches.

Structured Streaming

Structured Streaming (added in Spark 2.x) is to Spark Streaming what Spark SQL was to the Spark Core APIs: A higher-level API and easier abstraction for writing applications. In the case of Structure Streaming, the higher-level API essentially allows developers to create infinite streaming dataframes and datasets. It also solves some very real pain points that users have struggled with in the earlier framework, especially concerning dealing with event-time aggregations and late delivery of messages. All queries on structured streams go through the Catalyst query optimizer, and can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data.

Structured Streaming is still a rather new part of Apache Spark, having been marked as production-ready in the Spark 2.2 release. However, Structured Streaming is the future of streaming applications with the platform, so if you’re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.

What’s next for Apache Spark?

While Structured Streaming provides high-level improvements to Spark Streaming, it currently relies on the same microbatching scheme of handling streaming data. However, the Apache Spark team is working to bring continuous streaming without microbatching to the platform, which should solve many of the problems with handling low-latency responses (they’re claiming ~1ms, which would be very impressive). Even better, because Structured Streaming is built on top of the Spark SQL engine, taking advantage of this new streaming technique will require no code changes.

In addition improving streaming performance, Apache Spark will be adding support for deep learning via Deep Learning Pipelines. Using the existing pipeline structure of MLlib, you will be able to construct classifiers in just a few lines of code, as well as apply custom Tensorflow graphs or Keras models to incoming data. These graphs and models can even be registered as custom Spark SQL UDFs (user-defined functions) so that the deep learning models can be applied to data as part of SQL statements.

Neither of these features is anywhere near production-ready at the moment, but given the rapid pace of development we’ve seen in Apache Spark in the past, they should be ready for prime time in 2018.

Source: InfoWorld Big Data

DDN Partners With Hewlett Packard Enterprise

DDN Partners With Hewlett Packard Enterprise

DataDirect Networks (DDN®) has entered into a partnership with global high-performance computing (HPC) leader Hewlett Packard Enterprise (HPE) to integrate DDN’s parallel file system storage and flash storage cache technology with HPE’s HPC platforms. The focus of the partnership is to accelerate and simplify customers’ workflows in technical computing, artificial intelligence and machine learning environments.

Enhanced versions of EXAScaler® and GRIDScaler® parallel file systems solutions, and the latest release of IME®, a software defined scale out NVMe data caching solution, will be tightly integrated with HPE servers and the HPE Data Management Framework (DMF) software, enabling optimized workflow solutions with large-scale data management, protection and disaster recovery capabilities. They will also provide an ideal complement to HPE’s Apollo portfolio, aimed at high-performance computing workloads in complex computing environments.

“Congratulations to DDN and HPE on this expanded collaboration, which seeks to maximize data throughput and data protection,” said Rich Loft, director of the technology development division at the National Center for Atmospheric Research (NCAR) — and a user of an integrated DDN and HPE solution as the foundation of the Cheyenne supercomputer.  “Both of these characteristics are important to the data workflows of the atmospheric and related sciences.”

“With this partnership, two trusted leaders in the high-performance computing market have come together to deliver high value solutions as well as a wealth of technical field expertise for customers with data intensive needs,” said Paul Bloch, president, DDN. “In this hybrid age of hard drive-based parallel file systems, web/cloud and flash solutions, customers are demanding truly scalable storage systems that deliver deeper insight in their datasets. They want smarter productivity, better TCO, and best in class data management and protection. DDN’s and HPE’s integrated solutions will provide them with just that.”

DDN has been a trusted market leader for storage and parallel file system implementations at scale for nearly twenty years. The integrated offerings from DDN and HPE combine compute and storage in the fastest, most scalable and most reliable way possible.

“At HPE we’re committed to providing best practice options for our customers in the rapidly growing markets for high-performance computing, artificial intelligence and machine learning,” said Bill Mannel, vice president and general manager for HPC and AI Segment Solutions, HPE. “HPE and DDN have collaborated on many successful deployments in a variety of leading-edge HPC environments. Bringing these capabilities to the broader community of HPC users based on this partnership will accelerate the time to results and value that our customers see from their compute and storage investment.” 

Source: CloudStrategyMag

Review: H2O.ai automates machine learning

Review: H2O.ai automates machine learning
ed choice plumInfoWorld

Machine learning, and especially deep learning, have turned out to be incredibly useful in the right hands, as well as incredibly demanding of computer hardware. The boom in availability of high-end GPGPUs (general purpose graphics processing units), FPGAs (field-programmable gate arrays), and custom chips such as Google’s Tensor Processing Unit (TPU) isn’t an accident, nor is their appearance on cloud services.

But finding the right hands? There’s the rub—or is it? There is certainly a perceived dearth of qualified data scientists and machine learning programmers. Whether there’s a real lack or not depends on whether the typical corporate hiring process for data scientists and developers makes sense. I would argue that the hiring process is deeply flawed in most organizations.

If companies teamed up domain experts, statistics-literate analysts, SQL programmers, and machine learning programmers, rather than trying to find data scientists with Ph.D.s plus 20 years of experience who were under 39, they would be able to staff up. Further, if they made use of a tool such as H2O.ai’s Driverless AI, which automates a significant portion of the machine learning process, they could make these teams dramatically more efficient.

As we’ll see, Driverless AI is an automatically driven machine learning system that is able to create and train surprisingly good models in a surprisingly short time, without requiring data science expertise. However, while Driverless AI reduces the level of machine learning, feature engineering, and statistical expertise required, it doesn’t eliminate the need to understand your data and the statistical and machine learning algorithms you’re applying to it.  

Source: InfoWorld Big Data