MemVerge Splash: Open Source Solution to Improve Spark Shuffle Processes

MemVerge, Inc. announced Splash, an open source solution that allows shuffle data to be stored in an external storage system.

It is designed for Apache Spark software users looking to improve the performance, flexibility and resiliency of shuffle manager.

Traditionally, when shuffle data is stored remotely, system performance can degrade due to network and storage bottlenecks which can negatively impact performance and stability. Splash, working together with the company’s distributed system software named Distributed Memory Objects (DMO), solves these issues to make Spark performant through a high performance in-memory storage and networking stack.

The company’s Splash allows shuffle data to be stored reliably by using pluggable storage and network backends and maintains a dedicated storage cluster. It also helps improve elasticity by allowing users to adjust the size of the computing cluster without interrupting their shuffle computation. This is particularly important when Kubernetes is used for scheduling Spark tasks.

“We engaged with the Spark community to identify their pain points and built MemVerge Splash with these in mind,” said Charles Fan, founder and CEO, MemVerge. “There is no other solution currently on the market that can provide a complete solution to tackle the shuffle elasticity and performance problems like Splash. We welcome all users and developers to try and contribute to this new open source solution.“

With Splash, users can:

Use any external storage systems as a remote shuffle service
Extract the storage and network implementations from the shuffle procedure to allow users to apply different plug-ins for different storage and networks
Separate storage and compute
Tolerate node failure

“We chose to work with MemVerge because of the company’s deep understanding of big data applications and their ability to extract the most performance from the data,” said Zhen Fan, senior technologist, JD.com. “Splash is an optimized shuffle manager for a large scale Spark cluster. This solution improves shuffle performance and enables better tolerance of Spark node failures. With Splash, users can direct shuffle data to higher performance external storage to avoid data loss when Spark nodes fail. This is especially useful for users who manage Spark cluster of thousands of nodes, such as JD.com.“

Splash works with the company’s DMO and is also compatible with any third party distributed storage system (e.g. HDFS, CephFS.) and network stack. Additionally, it works with both on-prem and cloud deployments.

The firm’s proprietary DMO technology provides a logical memory-storage convergence layer that leverages Intel Corp.‘s latest persistent memory technology to allow data-intensive workloads to run seamlessly at memory speed, and can analyze and process large volumes of data in real time with ease.

Splash is available and can be accessed at github website.

Additionally, the company is available via its beta program.