当前位置：文档库 › Big_Data_1st_class_citizen

Big_Data_1st_class_citizen

Big Data must become a first class citizen in the enterprise An Ovum white paper for Cloudera

Publication Date: 14 January 2014

Author: Tony Baer SUMMARY

Catalyst

Big Data analytics have caught the imagination of enterprises because of the opportunities for

discovering new insights from data beyond the reach of enterprise data warehouses, using a

variety of approaches, some of which were not previously feasible using relational databases.

Created by a community of developers from the Internet world, Hadoop has emerged as the

leading new platform for Big Data analytics because of its scalability, flexibility, and reliance on

low-cost commodity infrastructure. Not surprisingly, as an emerging platform, early adopters

typically deployed Hadoop on dedicated infrastructure because of its unique resource consumption characteristics, and dedicated teams because of the need for highly specialized skills. Clearly, this

implementation pattern will not be sustainable for enterprises because of the need to

accommodate Hadoop and Big Data analytics largely with the teams and IT infrastructure that they already have.

Ovum view

Big Data -- and Hadoop -- must become first class citizens in the enterprise. The technology must

become accessible to the people and skills that already form the IT organization. Big Data

platforms cannot exist in their own islands. Instead they must map to existing data center

infrastructure, policies, and practices for managing resources and capacity; meeting service level

requirements; and governing and securing data. In turn Big Data projects must address

competitive or operational issues that already face the organization. An "embrace and extend"

strategy is essential, as Big Data will require new skillsets, adaptations to running the data center,

and new approaches to analyzing data. With Hadoop rapidly evolving from raw open source

framework to enterprise data platform, enterprises should evaluate the vendor's roadmap for

promoting accessibility, along with integrating its offering within the data center and existing data

warehousing environment..

Key messages

?To address the needs of enterprises, Big Data must become a first class citizen with the IT organization, the data center, and the business.

?Due to its scalability, flexibility, and economics, Hadoop has emerged as the leading analytic platform for Big Data.

?The most direct path to making Big Data -- and Hadoop -- a first-class citizen will be through an "embrace and extend" approach that not only maps to existing skill sets,

data center policies and practices, and business use cases, but also extends them.

?Big Data platform vendors must design their offerings to deliver the same degree of manageability, security, and integration as establishing data warehousing systems.

FROM SWAT TEAM TO ENTERPRISE MAINSTREAM

Hadoop's emergence

The origins of modern Big Data implementations began with Internet companies whose analytic

compute needs overwhelmed the carrying capacity of established SQL relational database

technology in several ways. The sheer volumes of data overwhelmed existing relational data

warehouses, with regard to daily refreshes that exceeded their batch windows, and the sheer

variety of data that was difficult to model because of the volatility, not only in data structure, but

also analytic needs. Furthermore, as data volumes surged to the petabytes, costs of licensing and maintaining traditional relational platforms grew unaffordable. Not surprisingly, the relational data

warehousing model broke down for Internet companies seeking to build search indexes, optimize

ad placement, or enhance online gamer experiences.

As a result, Internet firms created their own technology, open sourced it, and required special

expertise and dedicated infrastructure to run Big Data (primarily, but not exclusively Hadoop).

There were few concerns over security, capacity utilization, data stewardship, or information

lifecycle management, as the stakes for market dominance were high and resources deep.

Hadoop emerged as a data processing framework designed to solve unique, Internet-scale

operational problems such as optimizing ad placement or building search indexes.

In its early days, Hadoop lacked tooling, and its performance management and resource

consumption characteristics were not well understood. Consequently, there were few practitioners available, with deployments typically managed as separate projects tended by small, elite groups

of programmers on clusters apart from the data center. As such, at the time there were few

concerns over security, capacity utilization, data stewardship, or information lifecycle management.

Significantly, the primary security concern with early installations was authenticating users to gain

access to remote clusters to provision additional compute capacity.

Making the transition to the enterprise

With early successes, enterprises grew interested in applying the scalability and power of Hadoop

to address issues such as optimizing the customer experience; increasing operational efficiency;

or improving risk mitigation, fraud detection, and compliance. Hadoop also started maturing as

vendors began offering commercial support with value-add for features such as simplified

deployment; integrated monitoring; enhanced data ingestion and integration; authentication,

authorization, and access control; data security; and support of new processing frameworks

providing alternatives to MapReduce.

The "SWAT team" model used by early adopters for implementing Hadoop model is clearly

unsustainable for mainstream enterprises, who cannot afford to replace their SQL developers with

new talent; run Hadoop clusters as separate islands; or treat every question as a unique data

science exercise. Furthermore, as enterprises implement Hadoop, they must deal with the same

constraints and requirements that are customary for any major business application or data

platform, because nobody has unlimited capital budgets to keep opening or expending data

centers dedicated to Big Data and Hadoop. That entails policies regarding data access and

utilization, protection of customer privacy, and the need to manage compute and maintain service

levels in data centers with finite capacity.

BECOMING A FIRST CLASS CITIZEN

The goals are the same, but the means are different

Enterprise interest in Big Data, and using the Hadoop platform, is evolution, not revolution. It is

about gaining insight to address competitive, strategic, or operational issues facing the

organization. With Big Data, the difference is that there is now more data -- and more kinds of it --

that can be used for deriving that insight.

The goal remains the same; however, with Big Data, the means may be different. For instance

queries can evolve with the organization's needs, because data does not have to be formed into a

schema until runtime. They can be run using SQL, or other approaches such as MapReduce, for

large-scale processing; streaming, for real-time operational decisions; search, adding another

technique for ad hoc analytics that is useful when starting with variably structured data; and so on.

Big Data may involve new platforms in addition to relational systems; Hadoop has emerged as the

leading alternative to relational platforms for Big Data analytics on the strength of its low costs,

flexibility, and scalability.

Supporting the analytic value chain

As Hadoop becomes more enterprise-ready, its role is evolving from offline data storage and

exploratory processing platform to one that could supplement or claim the role of supporting the

analytic value chain from end to end. Hadoop's strength is not only its economics and scalability,

but also its flexibility for managing data and its growing capabilities to execute multiple types of

analytic and operational workloads.

That dictates that Hadoop become an intrinsic part of the analytic value chain, not a separate

island: It must become a first class citizen with IT, the data center, and the enterprise, as shown in Table 1.

Table 1. Making Hadoop a first-class citizen

Customer Vendor and/or Open Source Community

IT organization Hadoop implementation becomes accessible to

existing skillsets Extend Hadoop platform features, making it accessible to developers skilled in SQL, Java, and popular scripting languages

Data Center Hadoop must be managed to support existing data

center policies, practices, and constraints Develop/improve data management and governance capabilities: tracking data consumption and lineage; security including access control, authorization, and authentication; and ability to deliver predictable service levels/availability/reliability; support full backup and disaster recovery capabilities

Enterprise Hadoop and Big Data analytics are performed to

address familiar enterprise business issues Support integration with existing and emerging Big Data analytic tools and applications

Source: Ovum

Embrace and Extend

Based on experiences of Ovum enterprises clients, we have found that the most effective strategy for implementing Hadoop and Big Data analytics will involve an "embrace and extend" strategy

that builds off existing competencies, policies, and analytics, and extends them to leverage the

unique benefits that Big Data analytics and knowledge of the Hadoop platform provides (see

Figure 1). Therefore, beyond mapping Hadoop implementation to existing IT organization skills

base, data center policies and practices, and enterprise business cases, it will require adaptation

that:

?Extends platform and analytics know-how;

?Modifies data center operation to account for new forms and volumes of data; and

Extends the reach of analytics to address existing issues with new approaches or forms of querying.

Figure 1. Embrace and Extend

Source: Ovum

For the IT organization

Embrace existing SQL, Java, Python and similar programming language skills bases. While

Hadoop was originally designed with features such as Hive (as Hadoop's SQL-like implementation of a data warehouse), Pig (as a data flow language that is familiar to programmers), there are new capabilities that are emerging for supporting interactive SQL. Likewise, Hadoop programing

frameworks such as MapReduce and Spark were designed for Java, and can accommodate

analytic programs written in other popular languages such as Python or R. In many cases,

organization's adopting Hadoop can utilize many of their existing tools on Hadoop, as most BI,

analytics, and data transformation tools providers have already extended support for this platform.

To take maximum advantage of the power of Hadoop, these skills should be extended for working with larger and more variable, changing sets of data. For instance, while schema remains

essential, developers should take advantage of Hadoop's support for building schema at run time.

Additionally, new techniques, such as search, graph, and stream processing can add context to analytics, probe relationships between groups of people or things, and open a window to closed-loop real-time operational insight. In some cases, roles may be extended; power users could

assume data curation roles, where they not only generate queries, but also help identify potentially relevant sets of data from internal and external sources for analysis.

For the Data Center operation

Few enterprises have unlimited budgets when it comes to building and running their data centers.

Likewise, many organizations may be subject to regulatory scrutiny regarding access to and usage of sensitive data. As such, Hadoop installations must embrace the rules, policies, and practices that are expected of any data platform -- especially since in many cases, it may store the same types of structured data that have been stored in relational data warehouses (this is especially common with active archiving use cases). But it must also extend them to account for the unique demands of ingesting, storing, and consuming new types of data sets. This impacts conduct of security, resource management, and data governance and stewardship, as described below.: Security

This encompasses managing access and authorization for different classes of end users, and strong measures for authenticating end users. Depending on the sensitivity of the data, security may also involve protecting the sanctity of data and safeguarding privacy of customer records, and closely monitoring (and managing) the activity around how the data is used or transformed. Resource management and service level management

While a key benefit of Hadoop is its reliance on inexpensive commodity infrastructure, at some point, there are limits as to how much compute or storage can be allocated. Hadoop platforms (and/or third party tools) must support resource management policies, rules, and practices that prioritize workloads; provide capabilities for managing service levels (encompassing monitoring performance, balancing load, and ensuring availability and reliability). On the horizon, there will be demand for managing the full lifecycle of data, from optimizing tiering of hot data into memory to archival or disposal.

Data governance and stewardship

Big Data does not change the need for data quality, but it may demand different approaches

based on the nature, sensitivity, and the types of queries that will be run against the data (will the queries be exploratory in nature or require precise answers). For instance, some data types such as machine data or log files will not necessarily get cleansed, while other data types (e.g., social network or mobile device geolocation data) may become more valuable when correlated with

existing customer master identities. Compared to traditional data warehousing practices, there will be a broader range of approaches to managing quality of Big Data, from record-by-record

cleansing to alternatives that utilize probabilistic matching, machine learning, crowdsourcing, and

other approaches. Additionally, data lineage solutions, that track data by source, will become

useful tools for assessing the quality of data by how it is used and by the reliability of the source.

For the Enterprise

One of the most frequent questions that Ovum receives from clients is how to get started with Big

Data. We believe that that is the wrong question to ask. The purpose is not necessarily to work

with Big Data for its own sake, but for identifying use cases where Big Data can pick up where

conventional analytics leave off in providing better answers to existing competitive, operational, or

compliance-related issues facing the enterprise.

Making Big Data a first-class citizen in the enterprise means embracing the business cases that

are already important to the enterprise, while having the ability to re-imagine analytics without the

constraints imposed by relational systems, to reveal new answers. For instance, Hadoop's support of schema on read allows organizations to preserve the original raw data, allowing them to ask

new questions on different pieces of data that become more relevant as conditions in the

marketplace change. Hadoop's scalability and flexibility enables organizations to extend their

analytics across diverse sets of data that were traditionally not stored inside enterprise data

warehouses, and run different types of queries (e.g., streaming or graph analytics) that were not

feasible with SQL.

CLOUDERA'S STRATEGY FOR ENTERPRISE HADOOP

From offline data store to enterprise data hub

As the first vendor to deliver commercial support for Hadoop, Cloudera's strategy has been

consistent with Ovum's vision for making the platform a first class citizen of the data center. Its

positioning of Cloudera Distribution including Hadoop (CDH) as enterprise data hub is a clear

acknowledgement that Hadoop must become sufficiently robust to provide the platform for

managing multiple forms of data with the capability for running multiple types of workloads.

Admittedly, the quest for furnishing the logical and physical hub for enterprise data is, and will

continue to be, a hotly contested one. The takeaway is that delivering such a hub will not be

possible unless the platform can reside as a first-class citizen in the data center, providing full

manageability and support for enterprise policies regarding data access, protection, utilization,

stewardship, and governance.

Adding capabilities for data management, access, and query

Cloudera has been building towards this strategy by supporting (and contributing to) the Apache

open source projects, and delivering value-added features of its own to make Hadoop more

manageable. For instance, Cloudera Manager offers capabilities such as automates deployment

and configuration of Hadoop platform components; manages rolling updates, restarts, and

rollbacks; and provides features for monitoring system health and diagnostics. Recent

enhancements include an automated backup and recovery feature that not only replicates data,

but preserves all the metadata to ensure that data remains in sync even after restoration. Cloudera Navigator, another recently-added capability, addresses data lineage by tracking the origin and

use of data, and selectively enforcing access to specific sets of data.

Cloudera is also making Hadoop more accessible to the large professional skills base of SQL

developers. Having long partnered with leading ETL, BI, and Data Warehousing platform and tool

providers to provide connectivity between Hadoop and relational platforms, Cloudera has taken the next step with Impala, which supports interactive SQL query with a high-performance, parallel

processing framework that works against any Hadoop file format. Impala is intended to

supplement, not replace your enterprise data warehouse, providing an interface that can be

utilized, not only by SQL developers, but also familiar SQL-based query and BI tools from

providers such as Tableau, Qlikview, and MicroStrategy.

Cloudera is working with other initiatives designed to make Hadoop more versatile and accessible.

Cloudera Search optimizes Apache Solr for the Hadoop platform, enabling users to query Hadoop

data using a Google-like process. Additionally, Cloudera's support of the Apache Spark project will provide a complementary in-memory programming model for analytics.

RECOMMENDATIONS FOR ENTERPRISES

Big Data and Hadoop should be evolutionary moves for expanding the scope of analytics.

Ultimately, Ovum believes that most enterprises will implement Big Data analytics as part of an

analytics ecosystem where queries are directed at the right data sets, on the right platform, at the

right time based on parameters such as cost, priority, required service levels, and location of the

data. Such federated analytic will provide enterprises the flexibility they need -- and are only

possible if Hadoop is integrated with the rest of their analytic data platform environment.

When evaluating Hadoop platforms, examine the vendor's roadmap for supporting data integration along with the core management, security, and data management capabilities that are deemed

essential for any data warehousing platform. Admittedly, Hadoop technology is a rapidly evolving

and fast moving target; while the platform may not currently have parity with established relational

data warehousing systems, new capabilities are emerging rapidly from open source and vendor-

specific technologies and innovations.

Nonetheless, as the natural path for most organizations is to pilot, it is not essential that all

capabilities be available on day one. However, in the long run, your enterprise should plan on

Hadoop as an addition that will function inside your data center. Adopting an "embrace and

extend" strategy, your Hadoop implementation should be compliant with your existing policies

regarding data access, security, data quality, and lifecycle management; but at the same time,

those policies and practices will have to be extended because of the unique characteristics (and

benefits) of managing Big Data.

APPENDIX

Author

Tony Baer, Principal Analyst, Ovum IT Information Management

tony.baer@https://www.wendangku.net/doc/e117441773.html,

Ovum Consulting

We hope that this analysis will help you make informed and imaginative business decisions. If you have further requirements, Ovum’s consulting team may be able to help you. For mo re information about Ovum’s consulting c apabilities, please contact us directly at consulting@https://www.wendangku.net/doc/e117441773.html,.

Disclaimer

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any

form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the

prior permission of the publisher, Ovum (an Informa business).

The facts of this report are believed to be correct at the time of publication but cannot be

guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers

will be based on information gathered in good faith from both primary and secondary sources,

whose accuracy we are not always in a position to guarantee. As such Ovum can accept no

liability whatever for actions taken based on any information that may subsequently prove to be

incorrect.