Taming Big Data Beast
By Tom Leyden, Amplidata
This is a Press Release edited by StorageNewsletter.com on August 7, 2012 at 3:07 pmHere is an article written by Tom Leyden,
director of Alliances & Marketing, Amplidata:
The cloud industry has been on a complete high in the past few years, witnessing the arrival of new technologies and architectures including virtualisation, flash storage and cloud computing, the latter widely accepted as one of the main drivers behind the massive growth of storage requirements.
Another major newcomer on the IT stage is Big Data, which has moved beyond its original form i.e. Big Data analytics, to now also include Big Unstructured Data, such as that found in the media and entertainment, medical/scientific and defence industries. Managing Big Data requires a new approach and new tools if it is to be effectively done; organisations need to move away from traditional-often still RAID-based, – storage approaches. This is because Big Data applications such as storage clouds require architectures that can scale beyond hundreds of petabytes, the highest efficiency in the form of low power, low overhead infrastructures, high availability and durability into the ten nines.
The cause, effect and relation between online applications and data growth is not entirely clear: are online applications designed to help us manage our every larger sets or, conversely, do we simply store more data because it is now easier than before? Either way, the amount of data stored in the cloud continues to increase. Historically data was mostly stored locally and we relied on various, usually manually assisted, backup processes to protect it. However, as datasets grow, these procedures are becoming less and less efficient. Also, searching for a specific file or document has become more difficult, in certain cases leading to compliance issues. While file systems were designed to keep data organised, the explosion of unstructured data makes file systems quite complex and disorganised when it comes to finding a specific document. This is where cloud computing and Big Data meet. Backup and recovery processes are increasingly being moved to the cloud; document sharing archives are moved off tape and onto disk storage systems because they are much more valuable when the data is easily accessible. And the social-local-mobile hype stimulates us to generate data everywhere and to demand accessibility 24/7.
Currently, a number of object storage solutions are available on the market that can cope with data volumes such as those of Facebook’s or Amazon’s. The backend consists of a scalable storage pool that is typically built out of commodity storage nodes with very high density and low power consumption, while at the front are several controller nodes that provide performance. Access to the data comes through a REST interface, so the applications can read/write data without a file system in between. Files – objects – are dumped into the pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol. A good analogy is self vs. valet parking. When you self-park you have to remember the lot, the floor, the aisle, etc (file system); with valet parking, when you hand over your keys you get a receipt that you will use later retrieve your car.
For data protection of Big Data RAID is out of the question for a number of reasons. First, petabyte-scale systems in which all storage nodes have high-end processors (for rebuild purposes) would not be cost-effective, and RAID does not allow you to build a true single storage pool. In addition, RAID requires large amounts of overhead to provide acceptable availability. The more data we store, the more painful it is to need 200 percent overhead as some RAID systems do. A more recent and much more reliable alternative that provides the highest level of protection (ten nines and beyond) is erasure coding. Erasure encoding stores objects as equations, which are spread over the entire storage pool: data objects (=files) are split up into sub-blocks, from which equations are calculated. According to the availability policy, an overhead of equations is calculated and the equations are spread over as many disks are possible.
As a result, when a disk fails, the system always has sufficient equations to restore the original data block and can re-calculate equations as a background task to bring the number of available equations back to a healthy level.
Apart from providing a more efficient and a more scalable way to store data, erasure coding-based object storage can save up to 70 percent on the overall TCO thanks to reduced raw storage needs and reduced power needs.
It is therefore plain for everyone to see that with all the technologies in use today how we tame the Big Data beast, object storage is the only one that was developed from the ground up specifically to address the challenges posed by unstructured Big Data applications.