Hidden Costs of HPC Storage - Hyperion Research/Panasas

This is a large abstract of a report sponsored by Panasas, Inc. and written in 1Q20 by Steve Conway and Earl Joseph from Hyperion Research LLC.

Technology Highlight
New Study Details Importance of TCO for HPC Storage Buyers

Summary
In HPC storage systems purchasing there are things that can be easily measured, such as I/O performance and cost of acquisition, but often overlooked is the ongoing cost of operations and the negative impact that inconsistent and complex storage solutions can have on productivity and time to quality outcomes. As HPC becomes more widely disseminated with increasing adoption in the enterprise market, so does the need for a better understanding of how HPC storage impacts organizations, both from a productivity as well as from an infrastructure investment perspective.

To identify the key challenges in HPC storage deployments today, particularly in the areas of TCO vs. initial acquisition costs, Hyperion Research conducted a site survey that solicited feedback from data center planners and managers, HPC storage system managers, purchasing decision-makers and key influencers, and HPC storage system users, based in North America, EMEA and Asia.

While performance was the number one criterion for purchasing HPC storage, TCO was tied with the purchase price for second place.

HPC Storage Buyers’ Most Important Purchase Criterion
Hyperion Research Panasas F1

Current HPC Storage Capacity
Hyperion Research Panasas F6

Annual Growth in HPC Storage Capacity
Hyperion Research Panasas F7

The most often-named driver of the sites’ HPC storage growth (figure below) is the increase in iterative (multiple-run) simulations. Today’s more powerful HPC systems can allow many more attempts at a problem solution to be made in an allotted time frame. In the past, car designers might have been able to try out 3-4 designs in their part of the development cycle. Today, that process may involve hundreds or thousands of runs-and for regulatory and liability reasons, all that data may need to be stored for the market lifetime of the vehicle in question. Iterative methods are especially common in the manufacturing industry (parametric modeling), the financial services industry (stochastic modeling) and the weather/climate sector (ensemble modeling).

Most Important Driver of HPC Storage Growth
Hyperion Research Panasas F8

People, Productivity and Operations-Three Aspects of TCO
Although HPC storage buyers don’t agree on a single TCO definition, three areas are generally accepted as contributing to TCO: people (staffing), productivity and operations. This section presents survey findings in these areas.

People
In the average HPC storage deployment, skilled technical staff are required to manage day-to-day operations, tune and retune to sustain performance, and manage changing workloads. The complexity of open source file systems typically requires a larger and more skilled staff to keep the storage system operating at peak level. Recruiting, training and retaining skilled staff at an affordable level is getting more difficult, as the demand for these skilled technicians grows with the expansion of HPC into the enterprise and the emergence of new technologies and use-cases.

Storage staffing among the respondents ranged in size, from one to more than five FTEs per site (figure below).

Number of FTE Staff Managing HPC Storage
Hyperion Research Panasas F9

The estimated annual costs for the sites’ HPC storage staff vary considerably and sometimes top $500,000, presumably at sites employing multiple people to manage the storage operations (figure below).

Annual Cost for Staff Managing HPC Storage
Hyperion Research Panasas F10

Figure below shows that a people issue – recruiting and training storage staff – is also the most frequently cited challenge the sites associate with HPC storage operations.

Operational Challenges in Storage Infrastructure
Hyperion Research Panasas F11

Productivity
Supporting high productivity for users of HPC servers (scientists, researchers, analysts and engineering staff) is of paramount importance to data center managers and other senior officials at HPC sites. In some industries, a day of downtime can cost the organization more than $1 million in lost revenue. Lack of storage system resiliency in the face of failures and changing requirements has been an ongoing issue for some file systems. Optimal time to customer problem resolution is particularly challenging when there are multiple layers in the customer support chain.

As figure below indicates, more than 3/4 of the surveyed sites had episodes in the past year when storage issues reduced productivity. For 1 in 5 sites, this occurred more than 10x in the past 12 months.

Frequency of Tuning and Re-Tuning
Hyperion Research Panasas F12

Storage Issues Reducing Productivity
Hyperion Research Panasas F13

A substantial minority of sites said they experience storage system failures, defined as very significant episodes, on a monthly or weekly basis.

Hyperion Research Panasas F14

In most cases, recovery from a storage system failure took less than 24 hours, but in some cases this process took a week or more.

Hyperion Research Panasas F15

But even in cases where recovery from a storage system failure takes only one day, the cost can exceed $100,000.

Hyperion Research Panasas F16

Operations
Serving the HPC infrastructure needs of an organization is important and challenging work. Keeping the system performing at peak levels as workloads, applications and users are changed and added: causes disruption and stress under the pressure of pending deadlines. Installing an HPC storage system can take from under one day to more than one week, the respondents reported.

Hyperion Research Panasas F17

Future outlook
TCO is a term variously used in the HPC community and therefore deliberately presented to respondents of this study without a definition. This had the advantage of enabling the respondents to apply their own definitions. When they did, TCO emerged as one of the top purchasing criteria of the surveyed sites – tied in importance with “price” and second only to the “performance” of HPC storage systems under consideration.

As a category, the study shows, HPC storage systems have become more important in the current era of digital transformation and high-performance data analysis, including AI methods such as machine and deep learning.

To meet emerging requirements for what the US Department of Energy calls “extreme heterogeneity” – the convergence of simulation and analytics, traditional and enterprise environments, and interoperation with cloud infrastructures – HPC storage systems, like other parts of the HPC ecosystem, have become more complex and more challenging to manage in many cases. As the study shows, HPC storage systems are subject to downtimes that can increase costs while lowering productivity, and finding qualified job candidates to help manage HPC storage systems can be a major challenge. These trends are likely to continue.

With these factors in mind, Hyperion Research advises HPC sites to evaluate a range of HPC storage vendors before making a purchase decision. There are important differences in the vendors’ products, strategies and support. A wider search could pay large TCO dividends.

Hidden Costs of HPC Storage – Hyperion Research/Panasas

Half of respondents experience storage system failures once a month or more.

Comments