History (1993): Data Compression Into HDDs

Most magnetic tape units have data compressors. There are none inside disk drives, which would be considerably appreciated. Why?

Data compression and, vice-versa, decompression on HDDs (or diskettes) can be done with one of the numerous software offered on the market, often as a shareware, or with a dedicated card (Stac, lnfochip, etc.) included in the computer.

The operation is rather long in the first case and rather expensive in the second one. The perfect solution would be the integration into the disk drive of a specialized VLSI chip that would, very quickly and transparently, compress all data coming in and decompress all data going out, without slowing down the main computer’s processor.

For instance, if you buy a 100MB unit, with a data compressor, you can store close to 200MB with, additionally, a doubled transfer rate. This is what most magnetic cartridge drive manufacturers have chosen.

But it would be even worthier on disk drives where the price per megabyte is higher.

Fixed length sectors
Curiously, no disk drive supplier offers this type of solution. Of course, it’s not interesting to sell, for just a few more cents, a disk with a doubled capacity. But the real problem is not there. On a tape, you can store variable length of data, and on the contrary, in a disk, the sector has a fixed size, so that it is easier to control, usually 512B. When the disk has to write a sector and if it has to compress it, it will reduce it, for instance, to 380B, but will finally write it on the entire sector since it can’t do it differently. There is no gain in space.

You could imagine that the disk could wait for 2 sectors and if, by any chance, the compression reached less than 512B, let’s say 500B, the gain would be obvious, even if 12B were still lost.

But, unfortunately, it’s impossible to know the compression ratio in advance, since it depends on the type of file, and, for instance, it’s much bigger with texts than with code machines.

The responsible of all this is Microsoft’s Disk OS structure. As Seagate mentions, “the DOS along with the system BIOS have the task of controlling all file 1/0 operations between the system and peripherals, including storage devices. DOS also keeps track of the size of the files that it is managing and, therefore, knows the amount of disk spaces (sectors) that is required to store the files. If the files are compressed at the system level with software such as SuperStor or Stacker, the compression software masks the actual drive and file size information and gives DOS the information that it is expecting. “

Causing a DOS error
“If the files are compressed at the drive level,” adds Seagate, “DOS will save files of a known (uncompressed) size. But when DOS goes to read the file back, it will find it occupying a different, smaller, amount of space on the drive causing a DOS error. The way that the current system architecture exists with DOS and BIOS at system level and the storage controller at device (drive) level, insufficient intelligence and methods of communication exist between the system and drive to implement drive level compression.”

The 1.7 or 2.7 RLL already performing small compression
Additional elements should also be considered as disadvantages for integrating disk compressors. When it uses RLL 1 .7 or 2.7modes (a 7-bit per byte coding), the disk already operates some kind of compression, even if it is rather rough (counting sequences of 0s or 1s) which means that an additional compression won’t bring the expected space gain.

Additionally, some compression or (fragmenting) utility programs (that reorganize data to take out all unused space on the disk) sometimes bump into compressed files.

The size of controller cards in actual small disks doesn’t leave very much room to add the necessary electronics (chip, buffer, etc.) for this compression, that would also slow down the native data transfer rate of a peripheral device that already takes time answering the CPU.

Compression is risky
Finally, compression is not riskless. It’s unconceivable to do it somewhere else than in a cache memory, which means that if there is a power breakdown, data is definitively lost.

And that’s not all, if the compressor, that compares sequences of chaines of characters, makes the slightest mistake, it will have an impact on several writings. And there is no ECC to verify the integrity of the compressed data (except when it decompresses it to compare them with the input data, which would make the process even longer).

The almost complete lack of standard compression algorithms even restrains users, and not only for data exchange. When a disk is seriously damaged, the data can always try to be recuperated by a specialized company, but it’s almost impossible when the files are compressed.

All this doesn’t mean that data compression is questionable, it’s more its integration in HDD drives rather than in computers.

In fact, Microsoft’s next DOS 6.0 and Windows NT, should integrate a compression function just like in Unix, which will improve operations on disk drives, but will nevertheless slow down I/Os.

The best solution would be to include on the motherboard a special data compression chip that would compress and decompress all the storage peripherals on the fly as well as scanners, printers and fax/modems. It would also be useful if the compression algorithm was standardized to allow compressed data exchanges.

This article is an abstract of news published on the former paper version of Computer Data Storage Newsletter on issue ≠61, published on February 1993.