Saturday, February 28, 2015

Data Lake: Storage for Hadoop & Big Data Analytics

Companies have been tackling data challenges at scale for many years now. The vast majority of data produced within the enterprise comes from ERP, CRM, & other systems supporting a given enterprise function. This data was stored in Enterprise Data Warehouse (EDW)  for Business Intelligence analytics. Naturally, many organizations tried to use the existing Data Warehouses to serve as model for Big Data analytics.

As companies adapt big data analytics in a big way, the existing enterprise data warehouse (EDW) systems built on Network File Systems (NFS) are not scaling up.

Big data analytics on primary storage was good for a proof-of-concepts, but when it comes to production workloads, the traditional EDW systems on NFS arrays are too expensive and inadequate.

Effort to use EDW for Big data makes no sense!

Storing BIG data on Data Lake  

Apache Hadoop is designed to be a distributed file system and it has all the embedded availability, replication and protection mechanisms you need for storing huge amounts of data safely and, above all, it's very inexpensive. HDFS storage can be created  by simply adding disks into cluster nodes, and all the analytics & data management tools are running on the server.

Despite its advantage of being local to a cluster, HDFS has its challenges. One has to move large chunks of data from various systems into the cluster. Data in HDFS cluster is also poses a security risk and there is an inherent data loss problem with HDFS. Also companies are bound by various regulations and rules to protect and retain data. All this increases the total cost of acquisition and total cost of big data analytics.

An enterprise data lake - such as EMC Data Lake Foundation complement existing EDW and provides additional core benefits:

New efficiencies

Increase efficiency for data architecture through EMC Data Lake Foundation. ECS and Isilon significantly lower cost of storage and through optimization of data processing workloads such as data transformation and integration.  ECS provides a lower cost than a traditional Hadoop, while Isilon optimizes data processing workloads such as data transformation and integration.

New opportunities 

With EMC Data Lake Foundation, all data for analytics can be accessed - without having to copy over the data to a Hadoop cluster. This allows for a flexible 'schema-on-read' access to all enterprise data, and through multi-use and multi-workload data processing on the same sets of data: for both batch to real-time analytics.

EMC Data Lake Foundation is designed to support Apache Hadoop: HDFS for data storage & Hadoop YARN.

With ECS, HDFS data is stored in commodity drives - that provides scalable and reliable data storage that is designed to span across multiple data centers and even across continents. As a result, companies can build stable, reliable, & highly scalable data lakes - that span across multiple data centers.

Apache Hadoop YARN. YARN provides a pluggable architecture and resource management for data processing engines to interact with data stored in HDFS.

Hadoop YARN allows Multi-use, Multi-workload Data Processing. By supporting multiple access methods (batch, realtime, streaming, in-memory, etc.) to a common data set, Hadoop enables analysts to transform and view data in multiple ways (across various schemas) to obtain closed-loop analytics by bringing time-to-insight closer to real time than ever before.

Enterprise Scale Data Lake

EMC Data Lake Foundation is designed to provide Enterprise class data management capability - while keeping the costs low, and without disrupting the existing EDW workflows. ECS is designed to use commodity drives - that allows for a dramatically lower overall cost of storage. In particular when compared to standard enterprise NAS systems. The scale-out commodity storage with ECS provides a compelling alternative to Hadoop storage in commodity servers. ECS allows user to scale out their storage as and when their data needs grow, and completely decouples growth of compute from storage. This cost dynamic makes it possible to store, process, analyze, and access more data than ever before.

Cost of EMC Data Lake Foundation is designed to be lower than traditional Hadoop cluster.

EMC Data Lake Foundation would augment exiting ETL systems with Hadoop. For eg: With traditional EDW-BI Applications, companies would store only one year of raw data and store the BI reports in NAS. With Hadoop, it is possible to store 10 years of raw data plus all the BI ETL results. This results in much richer applications with far greater historical context. This allows companies to keep all source data and ETL results for future analytics.

Benefits of Unified Enterprise Data with EMC Data Lake foundation  

  • Store & Process all corporate data
  • Access all data simultaneously in multiple ways: Batch, Interactive, Real-Time
  • Automate all data management based on Policy
  • Provide Enterprise grade security for all data: Access control, Authentication, Data protection.
  • Use existing data management & security tools to manage Hadoop data
  • Enable both existing & new analytics applications to provide value to the organization
  • Provide a geo-spread scale-out data lake; but with single plane of management.
  • Provide a choice of data storage systems including traditional SAN array, NAS array, scale-out NAS, Object Storage and Hadoop
  • Efficient Data management Operations: Provision, manage, monitor & operate big data at a global scale.

In a nutshell, EMC Data Lake Foundation allows companies to:

  1. Collect everything: Store all data, both raw and results for extended periods of time.
  2. Dive in anywhere: Enables users across multiple business units to refine, explore and enrich data on their terms.
  3. Flexible access: Access data in multiple ways across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.

The result:- EMC Data Lake Foundation delivers maximum scale and insight with the lowest possible friction and cost.


Prima Source said...

Wow, Excellent post. This article is really very interesting and effective. I think its must be helpful for every one.Thanks for sharing your informative.IT Offshoring and Outsourcing|Big Data Analytics IndiaTop IT Companies In India

subbu raj said...

Thanks for sharing for information about hadoop and big data analytics.

Big Data Services

Unknown said...

Hello! This post couldn’t be written any better! Reading this post reminds me of my previous room mate! He always kept chatting about this. I will forward this page to him. Fairly certain he will have a good read. Thank you for sharing!