Spark 2.0 prepares to catch fire

Apache Spark 2.0 is almost upon us. If you have an account on Databricks’ cloud offering, you can get access to a technical preview today; for the rest of us, it may be a week or two, but by Spark Summit next month, I expect Apache Spark 2.0 to be out in the wild. What should you look forward to?

During the 1.x series, the development of Apache Spark was often at a breakneck pace, with all sorts of features (ML pipelines, Tungsten, the Catalyst query planner) added along the way during minor version bumps. Given this, and that Apache Spark follows semantic versioning rules, you can expect 2.0 to make breaking changes and add major new features.

Unify DataFrames and Datasets

One of the main reasons for the new version number won’t be noticed by many users: In Spark 1.6, DataFrames and Datasets are separate classes; in Spark 2.0, a DataFrame is simply an alias for a Dataset of type Row.

This may mean little to most of us, but such a big change in the class hierarchy means we’re looking at Spark 2.0 instead of Spark 1.7. You can now get compile-time type safety for DataFrames in Java and Scala applications and use both the typed methods (map, filter) and the untyped methods (select, groupBy) in both DataFrames and Datasets.