Preview of Alluxio 2.0 to Simplify Data Access for Cloud Workloads

Alluxio, Inc. announced the preview release of Alluxio 2.0, an platform upgrade since the inception of the company’s solution.

Click to enlarge

This preview release, available for free download, is an open source release with the most features added since the creation of the project and is designed to allow the community to experiment with new capabilities and explore the solution for new use cases such as simplifying data engineering and access for AI model training.

“Today, our users already deploy Alluxio at very large scale with many thousand node single cluster production deployments across telecommunications, retail and internet companies,” said Haoyuan Li, CEO and co-founder, Alluxio. “This release allows our users to take Alluxio deployments to the next level of scale with support for extreme data requirements. Our users as well as the data engineering community will find a much more intuitive interface with greatly expanded capabilities to help them run analytics and AI workloads on private, public or hybrid cloud infrastructures leveraging valuable data wherever it might be stored.“

“At China Unicom, we use Alluxio at scale as a core component of our modern data stack along with Apache Spark, HDFS, Hive and Apache Kafka. We are excited about Alluxio 2.0, particularly the new metadata management and scale out capabilities that will allow us to continue elastically scaling our deployment for the explosive data growth we see coming,” said Ce Zhang, senior software engineer, China Unicom Software Research Institute.

“AVA – our cloud-native deep learning platform – is built on Tensorflow, Caffe, Alluxio and KODO (a customized object store and CEPH). Alluxio orchestrates data movement from storage systems to data science environments, eliminating complex data engineering tasks and speeding up model training. Alluxio 2.0’s improved file system API to access data stored in any storage system will allow for accelerating machine learning training even further for faster innovation,” said Chaoguang Li, technical director, Atlab, Qiniu Cloud.

Alluxio 2.0 preview release provides features across key areas:

Support for hyperscale data workloads:
- - Support for more than one billion files – Option for tiered meta storage for files and objects enabling the unified namespace to scale to more than a billion files with metadata for hot data stored in the process memory while the rest is managed by Alluxio outside the process memory.

- - Highly distributed data services – 2.0 introduces the Job Service, a distributed clustered service, that data operations such as replication, persistence, cross storage move and distributed load now use, for enabling performance and massive scale.
  - Adaptive replication for increased data locality – This feature to configure a range for the number of copies of data stored in the solution that are automatically managed.
  - HA with embedded journal – A fault tolerance and HA mode for file and object metadata called the embedded journal that uses the RAFT consensus algorithm and is independent of any other external storage systems. This is helpful for abstracting object storage.
Enabling machine learning and deep learning workloads on any storage:
Machine learning and deep learning frameworks need to extract data from Hadoop and object stores, typically a very manual and time consuming process.
- - POSIX API. The company’s FUSE feature enables a POSIX compatible API so that frameworks like Tensorflow, Caffe and other Python-based models can directly access data from any storage system via the solution using traditional file system access.
Better storage abstraction for completely independent and elastic compute:
- - Support for HDFS clusters across different versions – Growth of data has led enterprises to have many data silos including multiple Hadoop clusters across many different versions. Unified access across these clusters is currently difficult. With version 2.0, users can connect to multiple HDFS clusters with any version to the solution and unify data access across them.
  - Active sync with Hadoop – Capability integrates with HDFS iNotify to update any data and metadata changes that happen to files stored in Hadoop allowing for applications accessing data via the firm’s solution to proactively receive the latest updates.

Additional resources:
Blog: Preview – enabling hyper-scale data workloads in the cloud
2.0 preview release documentation
Community Slack Channel