What are you looking for ?
Advertise with us
RAIDON

Hadoop Coming to Enterprise IT in Big Way – Taneja Group

What infrastructure solution vendors are offering?

A study titled Enterprise Hadoop Infrastructure Market Landscape for big data IT was conducted by Mike Matchett, senior analyst, Taneja Group.

The report focuses on why enterprise IT management must pay attention to Hadoop and how they should go about making a decision about what products and services to consider. This 20-page report also includes a snapshot of vendors that make up this market and how they fit into the Hadoop ecosystem. The report is designed to be useful to enterprise IT, vendors catering to the big data market and VCs investing in such vendors.

Hadoop is coming to enterprise IT in a big way. The competitive advantage that can be gained from analyzing big data is just too ‘big’ to ignore. And the amount of data available to crunch is only growing bigger, whether from new sensors, capture of people, systems and process ‘data exhaust’, or just longer retention of available raw or low-level details. It’s clear that enterprise IT practitioners everywhere are soon going to have to operate scale-out computing platforms in the production data center, and being the first, most mature solution on the scene, Hadoop is the likely target. The good news is that there is now a plethora of Hadoop infrastructure options to choose from to fit almost every practical big data need – the challenge now for IT is to implement the best solutions for their business client needs.

While Apache Hadoop as originally designed had a relatively narrow application for only certain kinds of batch-mode parallel algorithms applied over unstructured (or semi-structured depending on your definition) data, because of its widely available open source nature, commodity architecture approach, and ability to extract new kinds of value out of previously discarded or ignored data sets, the Hadoop ecosystem is rapidly evolving and expanding.

With recent new capabilities like YARN that opens up the main execution platform to applications beyond batch MapReduce, the integration of structured data analysis, real-time streaming and query support, and the roll out of virtualized enterprise hosting options, Hadoop is quickly becoming a mainstream data processing platform.

There has been much talk that in order to derive top value from big data efforts, rare and potentially expensive data scientist types are needed to drive. On the other hand, there is an abundance of higher level analytical tools and pre-packaged applications emerging to support the existing business analyst and user with familiar tools and interfaces. While completely new companies have been founded on the exciting information and operational intelligence gained from exploiting big data, we expect wider adoption by existing organizations based on augmenting traditional lines of business with new insight and revenue enhancing opportunity.

In addition, a Hadoop infrastructure serves as a great data capture and ETL base for extracting more structured data to feed downstream workflows, including traditional BI/DW solutions. No matter how you want to slice it, big data is becoming a common enterprise workload, and enterprise IT infrastructure folks will need to deploy, manage, and provide Hadoop services to their businesses.

This landscape report is intended to help enterprise IT explore some of the first questions about adopting Hadoop:

  • Which Hadoop distribution makes the most sense?
  • What is the right infrastructure/deployment model given Hadoop is available in physical, cloud, and virtual forms, with appliance, converged, and external storage options?

The landscape for big data vendors and solutions is broad so we first examine Hadoop, its use case, and continuing evolution to put potential solutions in context. We focus on the Hadoop infrastructure options and solutions that enterprise IT can evaluate and adopt. After briefly reviewing the big data challenge and opportunity that IT can help their organizations tackle, we consider what IT needs from big data solutions in order to support them as a production enterprise workload and deliver them as a service.

Next, we introduce a skeleton framework to help organize a categorization of solutions, and discuss where new solutions are emerging even as some vendors converge functionality. We then dive into a number of select vendor spotlights to illustrate the infrastructure market landscape, highlighting what each vendor brings to the table for enterprise IT while showing the breadth of options available. We include a brief look at some existing vendor solutions that are adding big data functionality to create broad data analytical platforms. Finally, we summarize our view of the Hadoop infrastructure market, and make some predictions as to what might happen next – change is definitely inevitable.

Hadoop Infrastructure Solution Vendors

Dell is well known as a source of endless racks of commodity servers, the express cost-efficient platform of choice for most Hadoop infrastructure designs. In addition, Dell offers a pre-tested combined Hadoop ‘solution’ that includes Cloudera’s Enterprise edition on a pre-spec’d set of Dell hardware components. We also note that Dell worked with Intel on their distribution’s specific optimizations for Intel chip-powered Dell servers.

HP is also a well known source for servers, and offers pre-tested configurations for all the major Hadoop distributions. In addition, they offer a turn-key HP Converged AppSystem for Hadoop that deploys with HP hardware, Cloudera Enterprise, and HP’s Vertica community edition analytical database. In addition, HP recently announced a project HAVEn, which promises to synergistically combine HP’s proprietary analytical solutions Autonomy, Vertica and ArcSight with Hadoop on an optimized hardware infrastructure, to create a unified platform for HP’s channel to use in creating vertical integrated big data analytics solutions.

In 2012 VMware sponsored an open sourced Project Serengeti to help deploy Hadoop clusters on virtual infrastructure. That has proved popular despite some concern that perhaps a virtualized hosting of Hadoop (which is originally designed to take advantage of MapReduce-style distributed processing on a scale-out cluster of commodity servers) would be either inefficient or costly. VMware worked hard to show that an optimized deployment of Hadoop applications in a virtual environment can provide comparable performance and an efficient use of resources, while bringing other benefits like ease of management, elasticity, and effective multi-tenant/mixed workload use of infrastructure. Based on Serengeti, VMware has recently announced vSphere Big Data Extensions providing built-in optimized support and operational management for Hadoop for vSphere and vCenter. BDE has internal algorithms to monitor for a defined QoS between Hadoop clusters (i.e. a production one v.s. development/test clusters) and enforce automated ‘elasticity’ actions to maintain balance (and in the process optimize utilization over time). BDE also helps recover from disk failure.

Similar to Project Serengeti, the open source Project Savanna aims to help host a virtualized Hadoop on OpenStack-based clouds, taking full advantage of native Openstack management and cloud service delivery capabilities. Mirantis and RedHat are key contributors, with RedHat Storage providing an interesting enterprise featured alternative to the native OpenStack Swift storage services.

While VMware is closer to GA, the service provider opportunity for Savanna – helping providers better compete with Amazon – could be quite significant. Since OpenStack supports all of the major hypervisors, this project is potentially far reaching and could come to be the key future platform for any kind of distributed scale-out cloud-based computing.

DDN entered the Hadoop market last fall by building a Hadoop appliance based on their HPC class SFA12k storage solution. DDN’s hScaler is intended for clients looking for 100 node or bigger clusters and to host very high performing applications. The highly available, high-performance storage offers denser, more efficient storage, which means less rack space, less total hardware and less power/cooling are required. Combined with high-speed networking (IB option) and an integrated ETL engine, SSD for metadata, I/O node pipelining, and end-to-end RDMA, the performance of this appliance is unmatched.

NetApp provides a pre-tested reference architecture for partners to build that combines Hadoop with NetApp storage solutions instead of local disk HDFS – NetApp FAS Series with Clustered ONTAP for meta data and management nodes, and E-series data stores with SAS connected storage partitions assigned to each compute node directly. There are a couple of variants, including the joint solution with Cisco UCS servers and Cloudera, and a Hadoop Rack version with Cisco Nexus networking and HP servers.

Teradata, best known for its analytical database, also now offers a unified big analytics appliance that combines the Aster database, Aster SQL-MapReduce, and the Hortonworks distro of Hadoop. This combination supports both MapReduce and SQL access to both Aster database and Hadoop data. Teradata solutions also come with enterprise management and a variety of data adaptors and ETL tools.

Oracle has engineered a system that features Cloudera’s distribution on an Oracle/Sun platform, with the additions of the Oracle NoSQL database and connectors to their analytical Exadata appliances and other Oracle databases. It also includes a version of Oracle R thrown in for good measure (a statistics analysis environment that competes with SAS). This is a big system sold in racks, with 40Gb IB networking. At scale, the economics should work out to reasonable per terabyte numbers, and provide very competitive performance, but initial adoption seems aimed at those with deep pockets and perhaps big investments in other Oracle solutions.

EMC Isilon is a scale-out clustered NAS that, in addition to common file protocols, also supports HDFS as a ‘remote’ protocol. Isilon can be used in a Hadoop environment as a replacement for running HDFS within the cluster. The compute nodes simply use the HDFS API to Isilon instead. This has some constraint on local IO throughput as Isilon is perhaps better aligned to serving lots of files in parallel rather than streaming a single huge file, but for datasets and workloads that are aligned with its capabilities, it adds enterprise storage features for data protection, reduces total storage (RAID vice replication) and enhances sharing and integration with multiple types of access.

Articles_bottom
ExaGrid
AIC
ATTOtarget="_blank"
OPEN-E