Big Data must become a first class citizen in the enterprise An Ovum white paper for Cloudera
Publication Date: 14 January 2014
Author: Tony Baer SUMMARY
Catalyst
Big Data analytics have caught the imagination of enterprises because of the opportunities for
discovering new insights from data beyond the reach of enterprise data warehouses, using a
variety of approaches, some of which were not previously feasible using relational databases.
Created by a community of developers from the Internet world, Hadoop has emerged as the
leading new platform for Big Data analytics because of its scalability, flexibility, and reliance on
low-cost commodity infrastructure. Not surprisingly, as an emerging platform, early adopters
typically deployed Hadoop on dedicated infrastructure because of its unique resource consumption characteristics, and dedicated teams because of the need for highly specialized skills. Clearly, this
implementation pattern will not be sustainable for enterprises because of the need to
accommodate Hadoop and Big Data analytics largely with the teams and IT infrastructure that they already have.
Ovum view
Big Data -- and Hadoop -- must become first class citizens in the enterprise. The technology must
become accessible to the people and skills that already form the IT organization. Big Data
platforms cannot exist in their own islands. Instead they must map to existing data center
infrastructure, policies, and practices for managing resources and capacity; meeting service level
requirements; and governing and securing data. In turn Big Data projects must address
competitive or operational issues that already face the organization. An "embrace and extend"
strategy is essential, as Big Data will require new skillsets, adaptations to running the data center,
and new approaches to analyzing data. With Hadoop rapidly evolving from raw open source
framework to enterprise data platform, enterprises should evaluate the vendor's roadmap for
promoting accessibility, along with integrating its offering within the data center and existing data
warehousing environment..
Key messages
?To address the needs of enterprises, Big Data must become a first class citizen with the IT organization, the data center, and the business.
?Due to its scalability, flexibility, and economics, Hadoop has emerged as the leading analytic platform for Big Data.
?The most direct path to making Big Data -- and Hadoop -- a first-class citizen will be through an "embrace and extend" approach that not only maps to existing skill sets,
data center policies and practices, and business use cases, but also extends them.
?Big Data platform vendors must design their offerings to deliver the same degree of manageability, security, and integration as establishing data warehousing systems.
FROM SWAT TEAM TO ENTERPRISE MAINSTREAM
Hadoop's emergence
The origins of modern Big Data implementations began with Internet companies whose analytic
compute needs overwhelmed the carrying capacity of established SQL relational database
technology in several ways. The sheer volumes of data overwhelmed existing relational data
warehouses, with regard to daily refreshes that exceeded their batch windows, and the sheer
variety of data that was difficult to model because of the volatility, not only in data structure, but
also analytic needs. Furthermore, as data volumes surged to the petabytes, costs of licensing and maintaining traditional relational platforms grew unaffordable. Not surprisingly, the relational data
warehousing model broke down for Internet companies seeking to build search indexes, optimize
ad placement, or enhance online gamer experiences.
As a result, Internet firms created their own technology, open sourced it, and required special
expertise and dedicated infrastructure to run Big Data (primarily, but not exclusively Hadoop).
There were few concerns over security, capacity utilization, data stewardship, or information
lifecycle management, as the stakes for market dominance were high and resources deep.
Hadoop emerged as a data processing framework designed to solve unique, Internet-scale
operational problems such as optimizing ad placement or building search indexes.
In its early days, Hadoop lacked tooling, and its performance management and resource
consumption characteristics were not well understood. Consequently, there were few practitioners available, with deployments typically managed as separate projects tended by small, elite groups
of programmers on clusters apart from the data center. As such, at the time there were few
concerns over security, capacity utilization, data stewardship, or information lifecycle management.
Significantly, the primary security concern with early installations was authenticating users to gain
access to remote clusters to provision additional compute capacity.
Making the transition to the enterprise
With early successes, enterprises grew interested in applying the scalability and power of Hadoop
to address issues such as optimizing the customer experience; increasing operational efficiency;
or improving risk mitigation, fraud detection, and compliance. Hadoop also started maturing as
vendors began offering commercial support with value-add for features such as simplified
deployment; integrated monitoring; enhanced data ingestion and integration; authentication,
authorization, and access control; data security; and support of new processing frameworks
providing alternatives to MapReduce.
The "SWAT team" model used by early adopters for implementing Hadoop model is clearly
unsustainable for mainstream enterprises, who cannot afford to replace their SQL developers with
new talent; run Hadoop clusters as separate islands; or treat every question as a unique data
science exercise. Furthermore, as enterprises implement Hadoop, they must deal with the same
constraints and requirements that are customary for any major business application or data
platform, because nobody has unlimited capital budgets to keep opening or expending data
centers dedicated to Big Data and Hadoop. That entails policies regarding data access and
utilization, protection of customer privacy, and the need to manage compute and maintain service
levels in data centers with finite capacity.
BECOMING A FIRST CLASS CITIZEN
The goals are the same, but the means are different
Enterprise interest in Big Data, and using the Hadoop platform, is evolution, not revolution. It is
about gaining insight to address competitive, strategic, or operational issues facing the
organization. With Big Data, the difference is that there is now more data -- and more kinds of it --
that can be used for deriving that insight.
The goal remains the same; however, with Big Data, the means may be different. For instance
queries can evolve with the organization's needs, because data does not have to be formed into a
schema until runtime. They can be run using SQL, or other approaches such as MapReduce, for
large-scale processing; streaming, for real-time operational decisions; search, adding another
technique for ad hoc analytics that is useful when starting with variably structured data; and so on.
Big Data may involve new platforms in addition to relational systems; Hadoop has emerged as the
leading alternative to relational platforms for Big Data analytics on the strength of its low costs,
flexibility, and scalability.
Supporting the analytic value chain
As Hadoop becomes more enterprise-ready, its role is evolving from offline data storage and
exploratory processing platform to one that could supplement or claim the role of supporting the
analytic value chain from end to end. Hadoop's strength is not only its economics and scalability,
but also its flexibility for managing data and its growing capabilities to execute multiple types of
analytic and operational workloads.
That dictates that Hadoop become an intrinsic part of the analytic value chain, not a separate
island: It must become a first class citizen with IT, the data center, and the enterprise, as shown in Table 1.
Table 1. Making Hadoop a first-class citizen
Customer Vendor and/or Open Source Community
IT organization Hadoop implementation becomes accessible to
existing skillsets Extend Hadoop platform features, making it accessible to developers skilled in SQL, Java, and popular scripting languages
Data Center Hadoop must be managed to support existing data
center policies, practices, and constraints Develop/improve data management and governance capabilities: tracking data consumption and lineage; security including access control, authorization, and authentication; and ability to deliver predictable service levels/availability/reliability; support full backup and disaster recovery capabilities
Enterprise Hadoop and Big Data analytics are performed to
address familiar enterprise business issues Support integration with existing and emerging Big Data analytic tools and applications
Source: Ovum
Embrace and Extend
Based on experiences of Ovum enterprises clients, we have found that the most effective strategy for implementing Hadoop and Big Data analytics will involve an "embrace and extend" strategy
that builds off existing competencies, policies, and analytics, and extends them to leverage the
unique benefits that Big Data analytics and knowledge of the Hadoop platform provides (see
Figure 1). Therefore, beyond mapping Hadoop implementation to existing IT organization skills
base, data center policies and practices, and enterprise business cases, it will require adaptation
that:
?Extends platform and analytics know-how;
?Modifies data center operation to account for new forms and volumes of data; and
Extends the reach of analytics to address existing issues with new approaches or forms of querying.
Figure 1. Embrace and Extend
Source: Ovum
For the IT organization
Embrace existing SQL, Java, Python and similar programming language skills bases. While
Hadoop was originally designed with features such as Hive (as Hadoop's SQL-like implementation of a data warehouse), Pig (as a data flow language that is familiar to programmers), there are new capabilities that are emerging for supporting interactive SQL. Likewise, Hadoop programing
frameworks such as MapReduce and Spark were designed for Java, and can accommodate
analytic programs written in other popular languages such as Python or R. In many cases,
organization's adopting Hadoop can utilize many of their existing tools on Hadoop, as most BI,
analytics, and data transformation tools providers have already extended support for this platform.
To take maximum advantage of the power of Hadoop, these skills should be extended for working with larger and more variable, changing sets of data. For instance, while schema remains
essential, developers should take advantage of Hadoop's support for building schema at run time.
Additionally, new techniques, such as search, graph, and stream processing can add context to analytics, probe relationships between groups of people or things, and open a window to closed-loop real-time operational insight. In some cases, roles may be extended; power users could
assume data curation roles, where they not only generate queries, but also help identify potentially relevant sets of data from internal and external sources for analysis.
For the Data Center operation
Few enterprises have unlimited budgets when it comes to building and running their data centers.
Likewise, many organizations may be subject to regulatory scrutiny regarding access to and usage of sensitive data. As such, Hadoop installations must embrace the rules, policies, and practices that are expected of any data platform -- especially since in many cases, it may store the same types of structured data that have been stored in relational data warehouses (this is especially common with active archiving use cases). But it must also extend them to account for the unique demands of ingesting, storing, and consuming new types of data sets. This impacts conduct of security, resource management, and data governance and stewardship, as described below.: Security
This encompasses managing access and authorization for different classes of end users, and strong measures for authenticating end users. Depending on the sensitivity of the data, security may also involve protecting the sanctity of data and safeguarding privacy of customer records, and closely monitoring (and managing) the activity around how the data is used or transformed. Resource management and service level management
While a key benefit of Hadoop is its reliance on inexpensive commodity infrastructure, at some point, there are limits as to how much compute or storage can be allocated. Hadoop platforms (and/or third party tools) must support resource management policies, rules, and practices that prioritize workloads; provide capabilities for managing service levels (encompassing monitoring performance, balancing load, and ensuring availability and reliability). On the horizon, there will be demand for managing the full lifecycle of data, from optimizing tiering of hot data into memory to archival or disposal.
Data governance and stewardship
Big Data does not change the need for data quality, but it may demand different approaches
based on the nature, sensitivity, and the types of queries that will be run against the data (will the queries be exploratory in nature or require precise answers). For instance, some data types such as machine data or log files will not necessarily get cleansed, while other data types (e.g., social network or mobile device geolocation data) may become more valuable when correlated with
existing customer master identities. Compared to traditional data warehousing practices, there will be a broader range of approaches to managing quality of Big Data, from record-by-record
cleansing to alternatives that utilize probabilistic matching, machine learning, crowdsourcing, and
other approaches. Additionally, data lineage solutions, that track data by source, will become
useful tools for assessing the quality of data by how it is used and by the reliability of the source.
For the Enterprise
One of the most frequent questions that Ovum receives from clients is how to get started with Big
Data. We believe that that is the wrong question to ask. The purpose is not necessarily to work
with Big Data for its own sake, but for identifying use cases where Big Data can pick up where
conventional analytics leave off in providing better answers to existing competitive, operational, or
compliance-related issues facing the enterprise.
Making Big Data a first-class citizen in the enterprise means embracing the business cases that
are already important to the enterprise, while having the ability to re-imagine analytics without the
constraints imposed by relational systems, to reveal new answers. For instance, Hadoop's support of schema on read allows organizations to preserve the original raw data, allowing them to ask
new questions on different pieces of data that become more relevant as conditions in the
marketplace change. Hadoop's scalability and flexibility enables organizations to extend their
analytics across diverse sets of data that were traditionally not stored inside enterprise data
warehouses, and run different types of queries (e.g., streaming or graph analytics) that were not
feasible with SQL.
CLOUDERA'S STRATEGY FOR ENTERPRISE HADOOP
From offline data store to enterprise data hub
As the first vendor to deliver commercial support for Hadoop, Cloudera's strategy has been
consistent with Ovum's vision for making the platform a first class citizen of the data center. Its
positioning of Cloudera Distribution including Hadoop (CDH) as enterprise data hub is a clear
acknowledgement that Hadoop must become sufficiently robust to provide the platform for
managing multiple forms of data with the capability for running multiple types of workloads.
Admittedly, the quest for furnishing the logical and physical hub for enterprise data is, and will
continue to be, a hotly contested one. The takeaway is that delivering such a hub will not be
possible unless the platform can reside as a first-class citizen in the data center, providing full
manageability and support for enterprise policies regarding data access, protection, utilization,
stewardship, and governance.
Adding capabilities for data management, access, and query
Cloudera has been building towards this strategy by supporting (and contributing to) the Apache
open source projects, and delivering value-added features of its own to make Hadoop more
manageable. For instance, Cloudera Manager offers capabilities such as automates deployment
and configuration of Hadoop platform components; manages rolling updates, restarts, and
rollbacks; and provides features for monitoring system health and diagnostics. Recent
enhancements include an automated backup and recovery feature that not only replicates data,
but preserves all the metadata to ensure that data remains in sync even after restoration. Cloudera Navigator, another recently-added capability, addresses data lineage by tracking the origin and
use of data, and selectively enforcing access to specific sets of data.
Cloudera is also making Hadoop more accessible to the large professional skills base of SQL
developers. Having long partnered with leading ETL, BI, and Data Warehousing platform and tool
providers to provide connectivity between Hadoop and relational platforms, Cloudera has taken the next step with Impala, which supports interactive SQL query with a high-performance, parallel
processing framework that works against any Hadoop file format. Impala is intended to
supplement, not replace your enterprise data warehouse, providing an interface that can be
utilized, not only by SQL developers, but also familiar SQL-based query and BI tools from
providers such as Tableau, Qlikview, and MicroStrategy.
Cloudera is working with other initiatives designed to make Hadoop more versatile and accessible.
Cloudera Search optimizes Apache Solr for the Hadoop platform, enabling users to query Hadoop
data using a Google-like process. Additionally, Cloudera's support of the Apache Spark project will provide a complementary in-memory programming model for analytics.
RECOMMENDATIONS FOR ENTERPRISES
Big Data and Hadoop should be evolutionary moves for expanding the scope of analytics.
Ultimately, Ovum believes that most enterprises will implement Big Data analytics as part of an
analytics ecosystem where queries are directed at the right data sets, on the right platform, at the
right time based on parameters such as cost, priority, required service levels, and location of the
data. Such federated analytic will provide enterprises the flexibility they need -- and are only
possible if Hadoop is integrated with the rest of their analytic data platform environment.
When evaluating Hadoop platforms, examine the vendor's roadmap for supporting data integration along with the core management, security, and data management capabilities that are deemed
essential for any data warehousing platform. Admittedly, Hadoop technology is a rapidly evolving
and fast moving target; while the platform may not currently have parity with established relational
data warehousing systems, new capabilities are emerging rapidly from open source and vendor-
specific technologies and innovations.
Nonetheless, as the natural path for most organizations is to pilot, it is not essential that all
capabilities be available on day one. However, in the long run, your enterprise should plan on
Hadoop as an addition that will function inside your data center. Adopting an "embrace and
extend" strategy, your Hadoop implementation should be compliant with your existing policies
regarding data access, security, data quality, and lifecycle management; but at the same time,
those policies and practices will have to be extended because of the unique characteristics (and
benefits) of managing Big Data.
APPENDIX
Author
Tony Baer, Principal Analyst, Ovum IT Information Management
tony.baer@https://www.wendangku.net/doc/e117441773.html,
Ovum Consulting
We hope that this analysis will help you make informed and imaginative business decisions. If you have further requirements, Ovum’s consulting team may be able to help you. For mo re information about Ovum’s consulting c apabilities, please contact us directly at consulting@https://www.wendangku.net/doc/e117441773.html,.
Disclaimer
All Rights Reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the
prior permission of the publisher, Ovum (an Informa business).
The facts of this report are believed to be correct at the time of publication but cannot be
guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers
will be based on information gathered in good faith from both primary and secondary sources,
whose accuracy we are not always in a position to guarantee. As such Ovum can accept no
liability whatever for actions taken based on any information that may subsequently prove to be
incorrect.