All Firms in De-DupeBut today's main question is: "Who is not involved?"
By Jean-Jacques Maleval on 2012.02.10
In data reduction, first technology was and continues to be lossless compression used at least since 1990 for tape, HDD or LAN/WAN transmission with software or chips to reduce the size of the files. Then de-dupe from 2000.
We don't study here the algorithms to compress sounds, images and videos.
According to wolframscience.com, modern work on data compression began in the late 1940s with the development of information theory. In 1949 Claude Shannon and Robert Fano devised a systematic way to assign codewords based on probabilities of blocks. An optimal method for doing this was then found by David Huffman in 1951. In the mid-1970s, the idea emerged of dynamically updating codewords for Huffman encoding, based on the actual data encountered. And in the late 1970s, with online storage of text files becoming common, software compression programs began to be developed, almost all based on adaptive Huffman coding.
In 1977 Abraham Lempel and Jacob Ziv suggested the basic idea of pointer-based encoding LZ (Lempel–Ziv). In the mid-1980s, following work by Terry Welch, the so-called LZW (Lempel–Ziv–Welch) algorithm rapidly became the method of choice for most general-purpose compression systems. It was used in programs such as PKZIP, as well as in hardware devices such as modems. Also noteworthy are the LZR (LZ–Renau) methods, which serve as the basis of the standard Zip method.
Among the first companies involved we found in 1990 InfoChip Systems in Santa Clara, CA and Hardware Architecture in Moscow, ID. One of the leaders at that time was Stac Electronics in Carlsbad, CA. There was also some proprietary methods to reduce data on tapes (HP DCLZ for QIC and DAT, IBM IDRC for 3480 cartridges, etc).
With compression, the average is no more than 2X reduction. De-dupe has completely changed the storage world with 10X to 100X ratios depending on the data. Note that de-dupe and compression can be used together.
Who invented de-dupe?
That's a difficult question. We have never heard about a company claiming to be the first one.
The pioneers seems to be Data Domain, Diligent, Exagrid, FilePool, Permabit, Riverbed and Rocksoft at the beginning of the century.
Data Domain was born in 2001 and conceived a D2D de-dupe appliance. After getting $41 million in financial funding, it raised $111 million following an IPO in 2007 and then was acquired by EMC for a huge $2.2 billion in 2009.
Israeli start-up Diligent, in secondary de-dupe, was acquired by IBM in May 2008 for $200 million.
ExaGrid Systems in Westborough, MA, was born in 2002. Formerly Inspection Systems, it was created by former employees of HighGround Systems and has now 1,200 customers and 4,000 installed systems.
Belgium firm FilePool (formerly Wave Research), co-founded by Paul Carpentier, now CTO of Caringo, was without question the pionner in CAS software. The start-up was taken over in May 2001 for $50 million by EMC to build the Centera, with content-derived addresses that permit only one protected copy of content to be stored no matter how many times it is used. We discovered patent filed by Carpentier and others as early as 1998.
Permabit (Cambridge, MA) was created in 2000 and continues to exist, having OEMs like HDS, LSI, Overland or StoneFly or Violin Memory.
Riverbed was founded in May 2002 in order to design an appliance for WAN optimization.
Born in 2002 in Adelaide, Australia, small start-up Rocksoft in de-dupe software, was bought by ADIC in 2006 for $63 million. Then Quantum got the technology following its acquisition of ADIC. In fact, Quantum did mainly this operation to get a tape activity. But now, it's a flagship technology for the company that was one of the first power in D2D backup subsystems. Quantum said to have issued 9 U.S. patents on de-dupe and 42 pending ones.
In the de-dupe process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. It may occur in-line, as data is flowing, or post-process after it has been written on disk. The operation can be done on blocks or files, through software or faster through a dedicated hardware appliance.
The basic idea is simple: when you transfer data between two sources, check which ones have already been transmitted and replace them by a small index. But practically, it's more complicated. Each firms has its own algorithm. There is no standardization, so de-dupe is perfect for backup but risky for archiving.
In the list below, we cannot guarantee that all companies are using their own algorithms and some have only patents and no products.
Today the question is more "Which storage companies do not have de-dupe?" rather than "Which companies are involved?". All these later sign OEM contracts with other ones to implemented de-dupe, a technology absolutely necessary today to sell backup or VTL and even WAN solutions, and probably in the future on primary storage systems, for the users to reduce its number of HDDs and more costly SSDs.
Note: after the name of a firm, a "/" precedes the company (ies) acquired for de-dupe.
(ABOUT) ALL COMPANIES IN DE-DUPE
American Megatrends India
Code 24 Software
Data Storage Group
Dynamic Solutions International
Opendedup (open source)
Pixel8 Networks (patent)
Sterling Data Storage
Tandberg Data (dataStor)