cnvrg.io and NetApp Together to Deliver MLOps Dataset Caching

cnvrg.io, the data science platform simplifying model management and introducing MLOps to the industry, announced its partnership with NetApp, Inc., the first to leverage the cnvrg.io dataset caching tool, a set of capabilities for immediate pulling of datasets from cache for any ML job.

The company offer is a first ML platform to use dataset caching for end to end ML development.

Caching allows datasets to be ready to use in secnds rather than hours, and cached datasets can be authorized and used by multiple teams in the same compute cluster connected to the cached data. Dataset caching is already used by the firm’s customers at production level.

It’s not uncommon to have hundreds of datasets feeding models.

However, those datasets may live far away from the compute that is training the models, such as in the public cloud or in a data lake.

Click to enlarge

With NetApp and the company’s dataset caching capability, users can cache the needed datasets (and/or their versions) and make sure that they’re located in the ONTAP AI storage attached to the GPU compute cluster or CPU cluster that is exercising the training. Once the needed datasets are cached, they can be used multiple times by different team members.

The firm’s dataset caching feature can be used by any cnvrg.io user with the ONTAP AI storage server. Once connected to an organization, data scientists can cache commits of their dataset on that Network File System (NFS). When a commit is cached, users can attach it to jobs for immediate high throughput access to the data, and the job will not need to clone the dataset on start-up.

The dataset caching feature creates the following business advantages:

Increased productivity – Datasets are ready to be used ins rather than hours.
Improved sharing and collaboration – Cached datasets can be authorized and used by multiple teams in the same compute cluster connected to the cached data.
Reduced cost – Models are pulling the datasets from the cache, reducing payments per download.
Operationalizing hybrid cloud – Dataset cache presents an on-premises high performance mirror storage.
Multi-cloud dataset mobility – with on-prem cache as control point for the data.

Click to enlarge

“Deep Learning workloads are unique in that they need access to random data samples from a large dataset that may be sourced from diverse data sources and dispersed locations,” said Santosh Rao, senior technical director, NetApp AI and data engineering. “Further, deep learning requires high performance data close to the GPU compute clusters and this requires the combination of high performance flash storage systems, connectors into edge, core and cloud for dispersed data location access and the support of widely used data source formats across NFS or other filesystems on a unified data platform. NetApp and cnvrg.io form a first of its kind partnership to bring these capabilities to customers worldwide adopting deep learning to transform their business.“

“Our partnership with NetApp drives productivity and efficiency for data teams,” says Yochay Ettun, CEO and co-founder, cnvrg.io. “We’re excited to launch our dataset caching for ML, to offer NetApp users and cnvrg.io users faster and simplified access to their datasets with tools for advanced data management and data versioning features that will allow data teams to focus on data science over technical complexity.“

About cnvrg.io:
It is an AI OS, transforming the way enterprises manage, scale and accelerate AI and data science development from research to production. The code-first platform is built by data scientists, for data scientists and offers flexibility to run on-premise or cloud.

Resource:
About the partnership and the dataset caching