AWS Summit: Alluxio Launches Data Orchestration Platform Powering Multi-cloud Analytics and AI

At AWS Summit New York, Alluxio, Inc. announced the availability of Alluxio 2.0 with innovations for data engineers managing and deploying analytical and AI workloads in the cloud, particularly for hybrid and multi-cloud environments.

Click to enlarge

The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage.

While this enables scaling elasticity, it introduces new data engineering problems – this is where an abstraction layer is needed. Just as compute and containers need Kubernetes, with data silos only increasing, data also needs orchestration – a tier, one that brings data locality, data accessibility and data elasticity to compute across data silos, zones, regions and even clouds.

“With a data orchestration platform in place, a data analyst or scientist can work under the assumption that the data will be readily accessible regardless of where the data resides or the characteristics of the storage. They can focus on building data driven analytical and AI applications to create values, without worrying about the environment and vendor lock-in,” said Haoyuan Li, founder and CTO, Alluxio. “These new advancements to Alluxio’s data orchestration platform further cement our commitment to a cloud-native, open source approach to enabling applications to be compute, storage and cloud agnostic.“

2.0 Community and Enterprise Edition includes capabilities across critical areas that are gaps in cloud data engineering market.

Breakthrough data orchestration innovation for multi-cloud:

Policy-driven data management:

- Alluxio 2.0 includes a capability that allows data engineers to automate data movement across storage systems based on pre-defined policies on an automated and on-going basis. This means that as data is created and hot, warm, cold data is managed, The company’s solution can automate tiering of data across any number of storage systems across on-premises and across all clouds.
- Data platform teams can reduce storage costs by automatically managing only the most important data in expensive storage systems and moving other data to cheaper storage alternatives.
Improved administration of data access policies
In addition to fine grained policies at the file level, now users can configure policies at any directory and folder level to streamline access of data as well as performance of workloads. These include defining behaviors for individual datasets on various core functions like writing data or syncing data with storage systems under the firm’s solution.

Cross cloud storage efficient data movement via data service

Data service allows for data movement including across cloud stores like AWS S3 and Google GCS, making expensive operations on object storage seamless to the compute framework.

Compute optimized data access for cloud analytics:

Compute-focused cluster partitioning
Users can partition a single Alluxio based on any dimension, so that datasets for each framework or workload isn’t contaminated by the other. Most common usage includes partitioning the cluster by framework Spark, Presto etc. In addition, this allows for reduced data transfer costs, constraining data to stay within a specific zone or region.
Integration with external data sources over REST
Users can bring in data even from web-based data sources to aggregate in the company’s solution to perform their analytics. Any web location with files can be simplify pointed to Alluxio to be pulled in as needed based on the query or model run.

Amazon AWS support:

AWS Elastic Map Reduce (EMR) service integration

Click to enlarge

As users move to cloud services to deploy analytical and AI workloads, services like AWS EMR are increasingly used. The firm’s solution can be bootstrapped into an AWS EMR cluster making it available as a data layer within EMR for Spark, Presto and Hive frameworks. Users now have a performing alternative to cache data from S3 or remote data while also reducing data copies maintained in EMR.

Architectural foundations using open source:

Many core foundational elements have been re-architected using the best open source technologies with a vision of hyper scale.
RocksDB is used for tiering metadata of files and objects for data that the company’s solution manages to enable hyperscale
GRPC – Google’s version of RPC is now the core transport protocol used for communication within the cluster as well as between the Alluxio client and master, making communications more efficient.

“Data is only as useful as the insights derived from it and with organizations trying to analyze as much data as possible to gain a competitive edge, it’s challenging to find useful data that’s spread across globally-distributed silos. This data is being requested by various compute frameworks, as well as different types of users hoping to gain actionable insight,” said Mike Leone, analyst, ESG. “These multiple layers of complexity are driving the need for a solution to improve on the process of making the most valuable data accessible to compute at the speed of innovation. Alluxio has identified an important missing piece that makes data more local and easily accessible to data-powered compute frameworks regardless of where the data resides or the characteristics of the underlying storage systems and clouds.“

“Whether by design or by departmental necessity, companies are facing an explosion of data that is spread across hybrid and multi-cloud environments. To maintain a competitive advantage, speed and depth of insight have become the requirement,” said Steven Mih, CEO, Alluxio. “Data-driven analytics that were once run over many hours, now need to be done ins. AI/ML models need to be trained against larger-and-larger datasets. This all points to the necessity of a data tier which orchestrates the movement and policy-driven access of a companies’ data, wherever it may be stored. Alluxio abstracts the storage and enables a self-service culture within today’s data-driven company.“

Click to enlarge

Other features, include:

Distributed data services – 2.0 introduces the Alluxio Data Service, a distributed clustered service, that data operations such as replication, persistence, for enabling high performance and massive scale.
Adaptive replication for increased data locality – this feature to configure a range for the number of copies of data stored in the solution that are automatically managed.
HA with embedded journal – fault tolerance and HA mode for file and object metadata called the embedded journal that uses the RAFT consensus algorithm and is independent of any other external storage systems. This is helpful for abstracting object storage.
POSIX API – the company’s FUSE feature enables a POSIX compatible API so that frameworks like Tensorflow, Caffe and other Python-based models can directly access data from any storage system via the firm’ solution using traditional file system access.

Alluxio 2.0 Community and Enterprise Edition are available for download via tarball, docker, brew, etc.

Additional resources:
Alluxio 2.0 release page
Download Alluxio 2.0
Founder blog
Product blog

Read also:
Alluxio Joined Cloud Native Computing Foundation
Founding engineer Bin Fan appointed as open source global evangelist
March 19, 2019 | Press Release
Preview of Alluxio 2.0 to Simplify Data Access for Cloud Workloads
Community feedback invited on release, including support for more than one billion files and Posix API for AI applications
March 15, 2019 | Press Release
Alluxio Closed $8.5 Million in Series B
Veritas founder and former CEO Mark Leslie joined investment round.
January18, 2019 | Press Release