Hadoop on Scality Ring

Evaluator Group, Inc. has published a report, Hadoop on Scality RING – Product Brief February 28, 2014.

Highlights

Distributed node-based software to build large scale object and file storage that replaces HDFS
Forward Error Correction supports geographic dispersal
Running Hadoop on Scality RING eliminates Hadoop NameNode vulnerability

Introduction
In previous reports we have outlined the issues we believe enterprise IT administrators have with implementing Hadoop within production applications environments-issues that are slowing the pace of Hadoop adoption within enterprise IT. Overcoming the issues is a critical qualifier for Hadoop adoption in the enterprise where business users may come to depend on Hadoop-based application availability. In addition, they know that, at some point, they will have to fit Hadoop into their data governance policies and practices. This will happen if and when IT administrators feel comfortable standing behind Hadoop application availability and integrating it into existing management processes and IT infrastructure.

There is little doubt that those behind the major enterprise Hadoop distributions (Cloudera, Hortonworks, and MapR) are aware of these issues to one degree or another and are working to address them. However, they are predisposed not to use enterprise storage arrays with Hadoop clusters for performance and cost reasons. Consequently, they tend to ignore the mature storage and data management services already available in these enterprise-grade storage platforms (i.e. snapshot copy, automated tiering, and remote replication as examples) that could be used to address the issues. Rather than leverage these pre-existing services, they labor to recreate them as additional Hadoop features and make them part of present and future Hadoop releases.

An alternative approach is now emerging that takes advantage of a distributed, clustered storage system and its built-in functionality and applies this functionality directly to Hadoop. Here we outline the approach Scality has taken that integrates its RING object storage technology to Hadoop by running Hadoop within the Scality RING distributed storage cluster itself.

This approach targets major sticking points from the standpoint of enterprise IT administrators who see great potential value in Hadoop, but struggle to get it through the pilot project or proof of concept phase and into production:
Hadoop security, data integrity, and system availability requires increasing administrative attention as the cluster grows.
Data loading and unloading for a Hadoop batch process typically takes more time than actually running the process.

Scality RING
Scality RING is implemented as software installed on commodity x86 servers to create a distributed, shared nothing, object storage system. RING is Scality’s name for their distributed node architecture which is based on the Chord protocol for peer-to-peer connections. RING uses a distributed hash table across nodes with key-value pairs. The key format comes from a SHA-1 hash and is 160 bits long: 128 bits for the object ID, 24 for the dispersion location (which node in which location), and 8 bits for the class of service-a selectable protection policy that can either be data replication or Forward Error Correction information dispersal. The RING algorithm maps stored objects to virtual key (node) spaces and routes access requests to the node owning the data.

Architecture
RING is composed of logical storage nodes (network) and IO daemons or ‘IODs’ (disk) – the basic building blocks. IODs are created on commodity x86 servers with storage (rotating disk and optional flash). A physical machine typically has 6 nodes (part of chord P2P system) and one IOD per physical drive. A maximum of 255 IODs can be configured per server depending on the number of disks contained within the server. Objects and metadata are stored in containers in a local file system. As an object storage system, RING distributes data and metadata across IODs according to the RING algorithm. Data and metadata are automatically rebalanced across IODs when an IOD is removed or added. The architecture is symmetric with all nodes having the same role and running the same code. SSD can be leveraged for fast metadata lookups.

Scality’s proprietary implementation of Forward Error Correction called ARC uses erasure codes with selectable protection and geographical distribution. In addition to information dispersal, Scality offers synchronous and asynchronous replication via their Multi-Geo feature. Another add-on feature is Sync-n-Share which provides file sharing and synchronization for files on the Scality system.

Unlike many object storage systems targeted for archive applications, RING is designed for high performance allowing Scality to target the replacement of traditional NAS and block storage for unstructured and semi-structured data. For file access, RING supports its own scale-out file system, CIFS (SAMBA), FTP, AppleTalk and NFS. Object access is over HTTP/REST protocol for S3 and CDMI APIs.

Hadoop on Scality RING
Highlighted use cases for RING include performance-demanding content repositories, big data analytics storage, and digital media are. Hadoop is also one of those use cases. However, in this case, users can opt to run Hadoop on top of the Scality RING storage platform using processing capacity available in the Scality’s IOD storage nodes.

The Hadoop MapReduce processing layer is implemented on the Scality architecture enabling the IODnodes to operate as the Hadoop compute nodes without having to change the RING configuration or restructuring data in Hadoop. Scality adds a CDMI connector between Hadoop’s MapReduce functions and the RING storage platform that essentially replaces the Hadoop distributed File System (HDFS) as Hadoop’s data store. This is done by configuring Hadoop to use Scality’s file system as opposed to HDFS in a map/reduce job description. Scality nodes run the Hadoop TaskTracker daemons while the Hadoop JobTracker remains unmodified. As a result, this implementation replaces the asymmetric NameNode/DataNode model with a symmetric model that eliminates the Hadoop NameNode as a well-known single point of failure.

From the standpoint of enterprise IT administrators evaluating Hadoop for production application environments, additional benefits could include:

Enhanced Hadoop Data Protection and System Availability
By default, HDFS creates three complete copies (i.e. not snapshot copies) of data. RING’s ARC for data protection at large scale could eliminate the need to do this for data protection purposes. However, if users want to maintain multiple data copies within the Hadoop cluster, that option is supported as well.

Scality’s Multi-Geo feature that supports synchronous and asynchronous data replication over metro-area and WAN distances could be used as a far more effective alternative to Hadoop’s distributed copy function (distCp) for DR. DistCp is prone to conditions where primary and secondary copies can lose synchronization.

Finally, because server nodes can be added and removed non-disruptively, the cluster can be scaled and system maintenance can be performed at will and without impacting Hadoop application availability. Data is automatically rebuilt to other nodes when they are removed and rebalanced when nodes are added.

Reduction or Elimination of Data Load and Unload Wait Time
Because Hadoop MapReduce is layered on top of the RING storage platform, users can reduce or eliminate the need for Hadoop data transfers. Processing is done against data in place when RING is used as a storage platform by other applications that are generating the data to be analyzed by Hadoop. The results of analysis can also be stored and be made available to future analytics processes. The Hadoop CDMI connector eliminates the need to load files through HDFS by utilizing Scality’s Open cloud Access (OCA) which supports FUSE, NFS, CIFS and CDMI transparently and simultaneously.

Evaluator Group Assessment
While running Hadoop essentially on top of a distributed, object-based storage architecture is an emerging model, we believe it is one worth enterprise IT’s close consideration. The storage architecture for Apache Hadoop is the Hadoop Distributed File System (HDFS). Hadoop’s data protection and fault recovery capabilities are built into HDFS. In order to deliver performance at relatively low cost, the developers of Hadoop and HDFS specifically avoided the use of vendor-developed storage systems with HDFS and therefore were not able to leverage the built-in and mature data services they offer (multiple data replication modes, data tiering, etc.). Instead, they recreated them as functions within HDFS. Unfortunately, they are not as robustly implemented as what enterprise IT administrators want to see. In addition, it is difficult to apply an enterprise’s regulatory compliance and data governance processes to Apache Hadoop.

Using Scality, Hadoop inherits the data protection and system availability characteristics inherent in the RING architecture including its ARC-based data protection without having to create three full copies of data as is the default for Apache Hadoop. And, as Scality matures and adds more functionality, added features can also be applied to the Scality Hadoop cluster, without modification to Hadoop. Finally, when applying Hadoop analytics to data already stored in RING, data does not have to be moved into or out of Hadoop, greatly accelerating Hadoop job production.