Volumez Shatters AI/ML Training Performance Barriers as Leader in MLCommons MLPerf Storage Benchmark
Solution achieved 1.079TB/s throughput with 92.21% GPU utilization and 9.9 million IO/s on AWS establishing new industry standard for AI/ML training performance.
This is a Press Release edited by StorageNewsletter.com on October 10, 2024 at 2:03 pmVolumez Technologies Ltd. announced performance results from the latest MLCommons MLPerf Storage 1.0 AI/ML Training Benchmark.
These results highlight firm’s commitment to delivering next-gen, cloud-aware data infrastructure for AI/ML workloads, pushing the boundaries of performance, scalability, and efficiency. The firm also announced its unique capabilities as the Data Infrastructure as a Service (DIaaS) company.
AI/ML training remains one of the most demanding workloads in modern data infrastructure. Maximizing throughput to drive optimal GPU utilization is critical for accelerating model training, improving accuracy, and reducing operational costs. In the MLPerf Storage 1.0 Open Division benchmark, which focused on the storage-intensive 3D-UNet model, Volumez DIaaS for AI/ML demonstrated linear scaling. The solution achieved a 1.079TB/s throughput with 92.21% GPU utilization and 9.9 million IO/s (1) on AWS -establishing a new industry standard for AI/ML training performance.
Benchmark overview and industry impact
Volumez deployed 137 application nodes (c5n.18xlarge), each simulating 3 H100 GPUs, streaming data from 128 media nodes (i3en.24xlarge) equipped with 60TB of storage/node. Unlike traditional architectures, Its DIaaS solution introduces no additional layers to the Linux data path and leverages cloud-aware intelligence to optimize infrastructure for the 3D-UNet workload. This approach delivered a level of speed and efficiency previously unseen in the benchmark, transforming both the economics and scalability of AI/ML training environments.
“These results mark a significant achievement for Volumez,” said John Blumenthal, chief product and business officer, Volumez. “The performance and scalability achieved during testing are unprecedented and highlight the critical role Volumez plays in the AI/ML ecosystem, providing solutions that meet the growing demands of AI/ML workloads on cloud infrastructure – to maximize the yield on our industry’s scarcest resource, GPUs.“
MLPerf benchmark achievements include:
- 1.079TB/s peak throughput, setting a new benchmark for AI/ML storage performance.
- 92.21% GPU utilization, driving efficiency in AI model training.
- 9.9 million IO/s, highlighting unparalleled data handling capabilities for large-scale workloads.
- Proven scalability for massive datasets, empowering businesses to tackle increasingly complex AI/ML models with ease.
“We are excited to have Volumez participate in their first MLPerf Storage benchmark. The importance that storage plays in the AI technology stack and the innovations happening in this space are critical to the successful deployment of ML training systems. MLPerf benchmark results are important measures for storage consumers to analyze as they seek to procure and tune ML systems to maximize their utilization – and ultimately their return on investment,” said David Kanter, head, MLPerf, MLCommons.
Industry perspective
According to Gartner “from a feature and functionality perspective, storage for GenAI is not too different from storage for any other analytics applications. The exception is that the performance capabilities required to feed the compute farm become even more relevant for GenAI and can be amplified at a larger scale. The training stage of GenAI workflow can be very demanding from a performance point of view, depending on the model size. Not only must the storage layer support high throughput to feed the CPU or GPU farm, but it also must have the right performance to support model checkpoint and recovery fast enough to keep the computer farm running.” (2)
Solutions like Volumez DIaaS are essential for enabling the next-gen of AI infrastructure that balances performance, scalability, and cost.
Innovative results in real-world environments
As an active member of the MLCommons community, Volumez took an additional step by submitting a second benchmark run in the Open Division. This submission focused on addressing real-world trade-offs faced by ML engineers and MLOps teams – optimizing throughput and utilization without sacrificing model accuracy. Specifically, we modified the benchmark’s weight exchange frequency, a common practice in high-scale environments. This adjustment reduces network overhead to achieve increased throughput and GPU utilization.
“We delivered a 1.140TB/s throughput and 97.82% GPU utilization (1), a 5.43% improvement over our 1st submission,” the company said.
WP for deeper dive into the architecture that powers the company’ DIaaS for AI/ML. The document provides insights into how the firm’s cloud-aware control plane drives transformative results for AI/ML workloads at scale.
Revolutionizing AI/ML infrastructure
The company has redefined the standards for AI/ML training infrastructure. By eliminating traditional bottlenecks and delivering industry performance, the firm’s DIaaS platform empowers organizations to accelerate their AI/ML initiatives and gain a competitive edge in a rapidly evolving market.
(1) Results verified by MLCommons Association.
(2) Gartner, 2024 Strategic Roadmap for Storage, by Jeff Vogel, Julia Palmer, Michael Hoeck, Chandra Mukhyala, February 23, 2024.
Resource :
Blog : Introducing Volumez: Bringing the Best Out of the Public Cloud with Data Infrastructure as a Service