Next-Gen De-Dupe and Compression: Object-Based De-Dupe

This article has been written by John Everett, storage business manager, Dell EMEA.

dell_john_everett_dedupe

The Next-Generation of Deduplication and Compression

With data growing at an unprecedented rate, organisations of all sizes
are looking to maximise the efficiency of how they store and manage data
throughout its entire lifecycle. This ongoing challenge has led to the
proliferation of technologies such as thin provisioning, automated
tiering and scale-out storage, which can deliver both Capex and Opex
savings through smart resource management for better utilisation rates,
increased energy efficiency and simplified administration.

Now, advances in deduplication and compression technologies are allowing
organisations to push utilisation rates even higher through what Dell
calls ‘content-aware storage optimisation’ – also known as object-based
deduplication – shrinking meaningful amounts of data for significant
cost and management savings.

At a basic level, deduplication is the process of eliminating duplicate
copies of data and replacing them with pointers to a single copy. Its
function helps organisations reach two primary goals: to reduce the
amount of storage capacity needed to store a myriad of data, and to
decrease the amount of data in flight during backup or replication
processes. As it stands, the dominant use case for deduplication is
backup storage, because of the amount of static data that organisations
have to backup. Nevertheless, deduplication technology has developed
into other data centre storage platforms such as NAS.

Some deduplication processes examine files in their entirety to
determine whether they are duplicates, which is referred to as
file-level deduplication, or ‘Single Instance Storage,’ while others
break the data into blocks and try to find duplicates among the blocks,
which is referred to as block-level deduplication. Block-level
deduplication typically provides more granularity and a greater
reduction in the amount of utilised storage capacity compared with
file-level deduplication. This is particularly appealing from a bac-up
perspective. Both types of deduplication are commonly used and offered
today; however, there is a growing appreciation that these approaches
may not be sufficient to handle the growth of big data in verticals such
as oil and gas, life sciences, media and entertainment.

A more intelligent form of deduplication has emerged in the form of
object-based deduplication. Now, organisations can take advantage of
next-generation technology that is tailored to their particular
vertical. This can be achieved with a solution that bridges the gap
between applications and native storage platforms to optimise the way
data is stored. This optimisation technology identifies how a given
file is structured, breaking it down to component sub files and then
selecting which is most effective from a library of more than 100
different compression algorithms for the targeted file. Even if the
file has never before been identified, and there is no content-specific
compressor, the technology will infer information about the structure
and nature of the contents to select the most effective data-reduction
algorithm. By understanding the layout of specific application files –
like an email programme or a digital image – IT can make intelligent
decisions about how to de-dupe and compress that data for optimal
storage.

The central components of Dell’s data-processing system include two
types of content-aware algorithms and a neural net framework for testing
and selecting different compressors for best run-time efficiency. The
two types of content-aware algorithms are de-layering algorithms, which
dissect files to identify the contiguous sub-objects, and data-shrinking
algorithms, which include deduplication and compression. These custom
compressors are more capable of shrinking meaningful amounts of data
that plague specific verticals.

To further reap the benefits of deduplication, this technology should be
able to be seamlessly applied across the entire IT infrastructure. To
this end, Dell is rolling out storage optimisation technology across a
variety of solutions for primary storage, archive, and backup.
Deduplication and compression will be integrated in the Dell Scalable
File System and Dell Object storage; once data is deduplicated, it can
move in a deduplicated state from one storage system to another. For
example, data that is deduplicated on Dell primary storage solutions can
be backed up without rehydration to Dell backup storage, which can then
be replicated in a deduped state over a LAN/WAN to a Dell backup
storage replica. It is this end-to-end optimisation of data from the
server to storage to the cloud that brings the most value to an end user
organisation in a data heavy world.

Even though dededuplication and compression technology has been around
for a few years, it is here to stay and is evolving rapidly. To be truly
effective in today’s business world as well as tomorrow’s,
organisations should look to a solution that adheres to three main
tenants:

to be transparent to the end user and applications, meaning that there should not be any performance delays upon retrieval;
to be customised to specific verticals with more and better algorithms and logic; and
to be utilised end-to-end across the entire workflow to ensure optimisation of the overall IT environment.