What are you looking for ?
IT Press Tour
RAIDON

Western Digital Uses Univa HPC Cloud Solutions on AWS to Build Million Core Cluster

Collaborative project demonstrates extreme scale HPC to exploit configuration flexibility of running HPC workloads in cloud using Univa Grid Engine and Navops Launch.

Univa demonstrated extreme scale HPC by working with Amazon Web Services (AWS) customer, Western Digital Corp., a data infrastructure company, using the company’s scalable cluster management and scheduling solutions, Navops Launch and Grid Engine.

Click to enlarge

Aws, Univa And Western Digital

 

The purpose of this collaborative project was to build a cloud-scale HPC cluster on AWS to simulate key elements of upcoming designs for their next-generation HDDs.

Click to enlarge

Univa Navops Scheme

 

Western Digital turned to the cloud to determine how virtually unlimited scale could allow them to solve R&D and engineering challenges faster. With this in mind, they teamed up with Univa and AWS to evaluate the impact of running their electro-magnetic engineering simulations on a massive HPC cluster built on AWS using Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.

The goal was to complete the job in the smallest amount of time and at the lowest cost. As part of this record-setting collaborative effort, Western Digital ran approximately 2.5 million simulation tasks on a Spot-based cluster of a little over one million vCPUs to determine optimal device characteristics that would help improve product quality, performance, reliability and durability for next-generation HDDs. That said: this project required complex multi-physics simulations that needed enough capacity to run deeper simulations for increasingly complex product designs. To put this in perspective, running 2.5 million tasks of this kind in an on-premises environment would take twenty days to complete.

Click to enlarge

Univa Gridengine

Storage technology is amazingly complex, and we’re constantly pushing the limits of physics and engineering to deliver next-generation capacities and technical innovation,” said Steve Phillpott, CIO, Western Digital. “This successful collaboration with Univa and AWS shows the extreme scale, power and agility of cloud-based HPC to help us run complex simulations for future storage architecture analysis and materials science explorations. Using AWS to easily shrink simulation time from 20 days to 8 hours allows Western Digital R&D teams to explore new designs and innovations at a pace un-imaginable just a short time ago.

Click to enlarge

Aws, Univa And Western Digital. Screen

The electro-magnetic simulations combined with the features of AWS Spot Fleet included roughly 40,000 Spot instances and more than one million vCPUs. With AWS, Univa’s scalable cluster management and scheduling capabilities of Navops Launch and Grid Engine were also used to coordinate cluster management and workload execution across the wide capacity of Western Digital’s infrastructure and keep the cluster fully utilized even under such a high workload. The result was a 60x reduction in simulation time – from twenty days to eight hours.

We are honored to have participated in such a unique project alongside Western Digital, who is a storage infrastructure leader,” said Gary Tyreman, president and CEO, Univa. “Univa works with hundreds of enterprise organizations who are often challenged with migrating HPC applications to the cloud, as this can typically be considerably more expensive than on-premises if not properly managed. Our Navops Launch solution gives HPC administrators the ability to control which applications are placed in the cloud, while also being able to control and monitor HPC cloud consumption and spend. I am proud of the work that the Univa team did alongside AWS, as we successfully demonstrated extreme scale HPC cloud.

Additional resources:
AWS blog post: Western Digital HDD Simulation at Cloud Scale – 2.5 Million HPC Tasks, 40K EC2 Spot Instances
Univa Blog: Mission Is Possible: Tips on Building a Million Core Cluster

Articles_bottom
ExaGrid
AIC
ATTO
OPEN-E