Three Reasons Object Storage Analytics Is Complicated and Distant

This article was published on December 14, 2017 on the blog of Chaos Sum, Inc., by company’s CTO and founder Thomas Hazel.

3 Reasons Object Storage Analytics is Complicated and Distant

There is huge growth in the amount of data that needs to be stored.

By 2025, we will generate 163ZB of data, an increase of 10x over current levels. On average, less than 1% of that data is stored, even less for business value.

Data needs to be saved and examined in order to create value and enable businesses to make decisions, compete, and innovate. This has led to a huge growth in object storage; but has also led to a dilemma. It’s easy and cheap to persist data in an object storage format. But historically, it has not been easy to identify, standardize, and analyze the object storage data to draw meaningful conclusions.

So, why is object storage analytics still so complicated and distant?

1. Object Storage Is not a Database
Although the concept of object storage has been around since the mid-1990s, it only began to gain true popularity after its adoption by Amazon Web Services (AWS) in 2006. AWS called it Simple Storage Service (S3) and is now massively popular e.g. if a storage region goes down, the internet virtually stops. Today, anything and everything is being sent to S3, and this deluge of data increases hourly.

Some of the key factors that have contributed to growth and popularity of object storage include:

Semi/Unstructured: Object storage can store data that wouldn’t necessarily fit into the rows and columns of a traditional relational database, such as emails, photos, videos, logs, and social media.
Backup/Compliance: Object storage includes metadata that represents an instruction manual for users. For compliance regimes that demand strict constraints on data use, object storage metadata represents an essential guardrail.
Cloud Computing: The killer app for object storage is unquestionably cloud computing. Object storage is given an address that lets systems find this data no matter where it’s stored – without knowing its physical location in a server, rack, or data center.

Since traditional databases were built on yesterday’s structured requirements, they are not designed for these very diverse, disjoined, and disparate data sources, particularly at scale. In a typical database, storage is limited by the cost and time it takes to manually transform data into a well-formed relational format. For this relational reason, such formatting is perfectly designed to be examined by today’s analytic tooling. Object storage, on the other hand, is not a database; the data is not relationally formatted, and therefore can be difficult to analyze.

2. It’s Difficult to Prepare Object Storage for Traditional Analytics
There’s a reason why one of the main uses for object storage is currently active archiving. This consists of data that businesses probably won’t need to access for a while, but will have to find quickly when the need arises. It’s about as far from analyzable data as you can get.

In order to perform analytics on object storage, you need to processes it … a lot. Most analytics tools are setup to use relational databases. To get these applications to run analytics on object storage, users need to manually transform it into a format that these tools can understand; and then load it into the tool to be analyzed. Ooph …

This is expensive in terms of time, money and resources. Businesses are growing their data assets at 40% per year, but Extract, Transform, Load (ETL) processes are not getting 40% faster and certainly not easier. What’s more, the big data frameworks used to prepare object storage for analytics is littered with war stories.

3. Traditional Big Data Platforms
Can Be Complex and Expensive to Work With
Once your object storage requires ETL processing, analytics can feel miles away. Surely small or structured data sources require little or no preparation for examination. Primary data is happy data, but happy data is not what is exploding. big data by its nature is not happy and certainly not clean, but diverse, disjoined, and disparate. As a result, big data cleaning and preparation takes 80% of the time to perform data analytics.

Most likely, you’ll try to use tools like Hadoop, which represent one of the most popular analytics solutions for the enterprise. While this platform is popular for a reason, it has complexities and drawbacks which make it difficult to use, especially once you consider the current reevaluation Hadoop has been going through.

Chaos Sumo was created to solve this object storage analytics problem. Its service streamlines ‘all’ phases of object storage analytics, enabling to query data without ever moving it or writing a line of code. It turns any object storage into a intelligent data lake.