Spark Summit 2017 Recap

The last Spark Summit 2017, promoted by Databricks, Inc. and organized in San Francisco, CA last June-5-7, was an interesting conference with approximately 3,000 attendees, 46 sponsors and exhibitors.

The conference wa focused on Apache Spark which is an open source fast and general engine for large-scale data processing and interactive analytics as mentioned on the Apache project page. It run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk, it thus explains its rapid adoption.

The project was originally developed at UC Berkeley AMPLab (Algorithms, Machines and People Lab) in 2009 by Matei Zaharia, also chief technologist and co-founder of Databricks, open sourced in 2010 under the BSD license, and then the project was donated to the Apache Software Foundation in 2013 and switched the licence to the Apache 2.0 model. In February 2014, Spark became a Top-Level Apache Project. Applications are written in Java, Scala, Python and R.

Spark is storage independent and doesn’t require any preferred shared storage. Data must be first ‘ETL-ed’ to be loaded from HDFS or another disk storage into the engine for processing. For storage, it’s pretty common to see Hadoop HDFS, MapR File System, Cassandra, OpenStack Swift, Amazon S3, Apache Kudu but also classic file storage such local file system or NFS-based data services.

During this conference, a few storage vendors were present as Spark represents an obvious opportunity with the massive volume of data manipulated. Among them, NetApp was there to promote classic E-Series and FAS, IBM. Also exhibited, Iguazio, MapR and Alluxio to name a few. Bit surprised: we do not see other storage vendors as big data means big storage capabilities but we have to recognize that Apache projects, open source by nature, are associated more than often with open source storage solutions.

Several Apache projects leaders organize and drive dedicated conferences that participate to the education of the market. All these players understand that this effort is a key element of their success and market adoption.

For Instance, Cloudera has Wrangle and support Strata+Hadoop conferences, Hortonworks had Hortonworks Summit now renamed DataWorks Summit, GridGain, leader for Apache Ignite, drives the In-Memory Computing Summit and Confluent, associated with Apache Kafka, promotes Kafka Summit.