SC19: Science and Technology Facilities Council Deploys Elastic NVMe Storage to Power GPU Servers

Excelero, Ltd. announced that the Science and Technology Facilities Council (STFC) has deployed a new HPC architecture to support computationally intensive analysis including ML and AI-based workloads using the NVMesh elastic NVMe block storage solution.

The deployment, done in partnership with Boston Limited is enabling researchers from STFC and The Alan Turing Institute to complete ML training tasks that formerly took 3 to 4 days, in 1 hour – and other foundational scientific computations that researchers formerly could not perform.

STFC’s cloud masking project using machine language to classifying satellite imagery

Science And Technology Facilities Council Excelero Boston F1

STFC is a part of UK Research and Innovation (UKRI) and supports pioneering scientific and engineering research by over 1,700 academic researchers worldwide on space materials and life sciences, nuclear physics and much more.

Research involves a variety of data-rich analyses on data generated by large-scale experimental facilities and observatories. These include cryo electron microscopy, synchrotron light, and other techniques. Workloads are massive – often hundreds of terabytes – and require fast compute and storage. For example, a typical workload involves an amount of computing multimodal data, such as those obtained from X-ray and neutron sources.

STFC’s Scientific ML (SciML) Group was established with the aim of enabling scientists to analyse large amounts of data, with the group bringing ML and AI expertise. The group routinely utilises deep neural networks running on NVIDIA DGX-2 GPU computing systems located at the Scientific Data Centre at its Rutherford Appleton Laboratory site near Oxford.

As the need for image processing expanded, the use of GPU-based workstations needed to be extended to support the high throughput and low latency required for end user response times. Adding DGX-2 servers offered higher computational support, yet lacked the enterprise-level storage functionality required to scale out the resource across the hundreds of researchers.

HPC solutions provider Boston Ltd., worked with STFC to evaluate all-flash arrays and open systems-based storage options, and commissioned a benchmark of Excelero’s NVMesh for share NVMe flash storage at local performance.

Boston’s benchmark results showed the proposed STFC architecture delivered an average latency of 70μs – nearly one-quarter of the typical 250μs latency of traditional controller-based enterprise storage when running NVIDIA validation tests on each DGX-2 system. The combined NVMesh and BeeGFS deployment therefore showed potential for meeting STFC’s high throughput, low latency demands.

STFC’s storage architecture now includes two Boston Flash-IO Talyn systems built on SuperMicro building blocks, networked via a Mellanox 100Gb IB network to two DGX-2 computing systems, each with 16 NVIDIA 32GB V100 SXM modules.

Boston Flash-IO Talyn

Boston Flash Io Talyn

Operational since July 2019, STFC’s storage architecture enabled running training sets that formerly took 3 to 4 days, in under an hour. With the BeeGFS file system providing a single name space to simplify management and virtualisation, and the low latency and high throughput of its NVMesh system, STFC now has a GPU computing architecture where storage no longer presents a bottleneck, even with its complex research needs.

Backed by its new deployment, user communities surrounding this new system, including users from The Alan Turing Institute, are now able to carry out ML research projects covering a number of disciplines, including environment, life sciences, materials, space sciences and astronomy.

“In benchmark testing we quickly saw that our Flash-IO Talyn systems with the Excelero NVMesh software delivered a significant performance enhancement over traditional controller-based architectures found in AFAs – and the ease of installation of a packaged solution,” said Matthew Parfitt, HPC commercial manager, Boston, who oversaw the deployment.

“Fundamental scientific research needs a clear computational path to completion, without the storage bottleneck that is endemic when NVMe resources are not virtualised,” said Lior Gal, CEO and co-founder, Excelero. “We’re proud that our NVMesh software helped our partner Boston put an essential building block in place in the STFC architecture to support STFC’s vital initiatives.“

Excelero and Boston are showcasing their deployment at STFC along with deployments at other major HPC research facilities at the SC19 event in Denver, CO.