Cloudera Debuts Beta Program of Hadoop and Apache Search Engine

From The Economist
Information Forum in San Francisco, CA, Cloudera
Inc. announced the public beta of Cloudera
Search, an integrated search engine for interactive exploration
of data stored in the Hadoop Distributed File System and Apache
HBase.

It is designed to simplify and increase Hadoop’s usability by more
departments of an organization and powered by the open source search
engine, Apache Solr, it enables anyone within an organization to
perform interactive, natural language keyword searches and faceted
navigation on data stored in Hadoop, without additional training or
advanced programming knowledge.

The solution was developed to address an emerging need, as
enterprises’ Hadoop deployments mature and advance to become the
primary repositories for more and more kinds of data: how to better
combine and refine data into a single, integrated platform. At its
core, it incorporates Apache Solr and other search-related open
source projects to support a big data infrastructure, and to
alleviate the costs of maintaining the disparate systems that many
enterprises currently depend on to execute search queries.

The arrival of such system provides the enterprise with simplicity
and exploration capabilities, so users can drill down deeper into
data using full-text and faceted search to solve business problems in
real-time. This search solution combines the established,
feature-rich, open source search platform of Solr and its extensible
APIs for integration with production legacy systems, offering
integration with CDH that address many of the pain points of
standalone search solutions for Hadoop. Through the robust failover
features available in SolrCloud (Solr4), it delivers the same feature
set of the search platform with more scalable indexing and query
serving than was ever previously possible.

Like Cloudera Impala, the industry’s first open source,
interactive SQL query engine for Hadoop, this extends the reach and
capability of Cloudera Enterprise, the platform for big data. The
company is making it possible for enterprises to ‘unaccept
the status quo‘ imposed
by closed source solutions vendors and benefit from the economics and
opportunity of Hadoop as a central, enterprise data platform that
addresses the challenges and opportunities presented by big data.

Beyond SQL: Now Everyone Can Benefit from Hadoop
As
enterprises look for ways to derive value from all their data, a
pervasive challenge has emerged: how to make all data available and
consumable beyond IT departments, so it can be more leveraged across
an entire organization. This solution expands the data exploration
capabilities of Hadoop with faceted navigation and full-text search
to find data for processing and analysis. It puts the power of data
discovery into the hands of non-technical teams, enabling line of
business and everyday users to interact with and uncover relevant
correlations from data in a familiar interface. Companies can provide
access to a centralized data repository and make it accessible to
anyone who wants to derive insight and consolidate search and Hadoop
cluster investments into one, complete solution with unified
management and control through Cloudera Manager .

"Data is one of the most valuable assets we have when it
comes to preventative mental and physical healthcare," said
Chris Poulin, managing partner, Patterns and Predictions. "With
next generation predictive analytics tools powered by Hadoop,
healthcare providers can now address healthcare issues proactively
and hope to solve even the most intractable challenges, like suicide
prevention for military veterans. With the power to correlate medical
reports, patient records, care provider notes, and social media data
along with other relevant data sources, we can cultivate a deeper,
more holistic understanding of patients and disease to support better
treatment plans and optimize patient care. By giving non-technical
individuals the power to perform real-time search and queries on data
stored in Hadoop, Cloudera is providing critical tools to advance
healthcare innovation and discovery."

Beyond Batch:
Real-Time Interaction with Data in Hadoop

This
provides enterprises scalable indexing options for big data and
extends the Solr project to offer near real-time document processing
and indexing of data in transit to Hadoop and other storage
endpoints. Data is available to Search and other Hadoop computing
frameworks, like Apache Hive and Cloudera Impala. It
also provides linearly scalable batch indexing for large data stores
within Hadoop on-demand, and with the introduction of a GoLive
feature can incorporate incremental index changes, while avoiding
downtime.

"We have been leveraging Cloudera Search for OpenStack log
exploration with great success. It delivers an open source solution
for near real-time operational insights stored in Hadoop, and
supports faster analytics and time to insight through applications
like Cloudera Impala and other workloads," said Joseph
George, director of product strategy in Dell Inc.‘s revolutionary solutions team. "With Cloudera Search, Hadoop has become the
master data hub, where search indexes can be easily built on demand,
executed, stored and easily managed."

"It’s exciting to see Lucene, a project I started 15 years
ago, be included in CDH," said Doug Cutting, chief
architect, Cloudera. "Search is an incredibly powerful tool –
now it’s scalable and integrated with the Hadoop platform."

Highlights:
The product is designed to support
business users with their quest to locate data quickly in Hadoop, for
further processing and analysis. It is integrated with the CDH
platform.

Scalable Index Storage in HDFS: integrates index
storage and serving into HDFS

Batch Indexing via MapReduce: allows for index
creation of data stored in HDFS and HBase as scalable as MapReduce

Real-time Indexing at Collection: makes an event
searchable as it is stored into Hadoop through near real-time
indexing features powered by Apache Flume

Interaction and Data Exploration via Cloudera Hue:
provides a plug-in application for Hue and capabilities for
standard Hue servers to query data and view result files, and enables
faceted exploration.

Field Extraction and Cross-Platform Data Processing:
allows for field extraction of any data that is stored into
HDFS using Hadoop file formats, such as Apache Avro, avoiding the
pain that many standalone search solutions might impose, and promotes
reusable configurations and processing activities with the processing
framework, Cloudera Morphlines

Unified Management and Monitoring with Cloudera Manager:
provides a centralized management and monitoring experience
that makes it as easy to deploy, configure, and monitor search
services as it is to manage CDH deployments and other services on the
Hadoop cluster

"We’re bringing the band back together with Cloudera
Search," said Mike Olson, CEO, Cloudera.
"Based on 100% open source Apache Solr, a Lucene project and
another Doug Cutting original, Cloudera Search is now fully
integrated into our industry leading CDH big data platform. After a
successful private beta, it’s the latest in a series of major
innovations that we’ve brought to market designed to speed up and
simplify an organization’s ability to get the most out of their data.
We are further democratizing access to mission-critical information
stored in Hadoop by ensuring those without programming expertise can
gain insight, find patterns and derive true value from their
information assets. Year after year we continue to push the
boundaries of what is possible with Hadoop; we have the best minds in
data management focused on advancing business transformation."

The first in the market
to ship code, Cloudera Search is available as a supplemental module
for Cloudera Enterprise subscribers.