Apache Spark 2.0.0 Release Doubles Down On Big Data

Friday, July 29, 2016

Apache Spark 2.0.0 is the first release on the 2.x line offering the first major release of open source Spark since Spark 1.6 in 2015. The major updates in the release include API usability, SQL 2003 support, performance improvements, structured streaming, R UDF support, as well as operational improvements. In addition, this release includes over 2500 patches from over 300 contributors.

Databricks, a company founded by the team that created Apache Spark, quickly followed the announcement of the generally availability of Apache Spark 2.0 with its own announcement of its platform’s compatibility with 2.0.

As noted by the Databricks team, some of the leading features of the Apache Spark 2.0 release include:

- Speed: Gaining huge performance in orders of 5 to 10 times faster than Spark 1.6 for some Spark operators due to Tungsten's Phase 2 whole-stage-code generation and Catalyst's code optimization.

- Simplicity: Unifying developer APIs across Spark's libraries such as DataFrames and Datasets.

- Structured Streaming: Laying the foundation for continuous applications by providing high-level declarative streaming APIs based on DataFrames and Datasets built atop Spark SQL engine that works on real-time data.

- Machine Learning Model Persistence: Saving and loading pipelines and models across all programming languages supported by Spark.

- DataFrame-based Machine Learning APIs: Emerging as the primary MLlib package with its "pipeline" APIs and focusing future developments on DataFrame-based API.

- Standard SQL Support: Expanding Spark's SQL capabilities for SQL:2003 features, introducing new ANSI SQL parser, and supporting scalar and predicate type subqueries.