Wednesday, June 14, 2017

How to Design a Successful Data Lake

Today, business leaders are continuously envisioning new and innovative ways to use data for operational reporting and advanced data analytics. Data Lake is a next-generation data storage and management solution, was developed to meet the ever increasing demands of business & data analytics.

In this article I will explore some of the existing challenges with the traditional enterprise data warehouse and other existing data management and analytic solutions. I will describe the necessary features of the Data Lake architecture and the capabilities required to leverage a Data and Analytics as a Service (DAaaS) model, characteristics of a successful Data Lake implementation and critical considerations for designing a Data Lake.

Current challenges with Enterprise Data Warehouse 

Business leaders are continuously demanding new and innovative ways to use data analysis to gain competitive advantages.

With the development of new data storage and data analytic tools, the traditional enterprise data warehouse solutions have become inadequate and are impeding maximum usage of data analytics and even prevent users from maximizing their analytic capabilities.

Traditional data warehouse tools has the following shortcomings:

Introducing new data types and content to an existing data warehouse is usually a time consuming and cumbersome process.

When users want quick access to data,  processing delays can be frustrating and cause users to ignore/stop using data warehouse tools, and instead develop an alternate ad-hoc systems  which costs more, waste valuable resources and bypasses proper security systems.

If users do not know the origin or source of data  - currently stored in the data warehouse, users view such data with suspicion and may not trust the data. Current data warehousing solutions often store processed data - in which source information is often lost.

Historical data often have some parts missing or inaccurate, the source of data is usually not captured. All this leads to situations where analysis results provide wrong or conflicting results.

Today's on-demand world needs data to be accessed on-demand and results available in near real time. If users are not able to access this data in time, they lose the ability to analyze the data and derive critical insights when needed.

Traditional data warehouses "pull" data from different sources - based on a pre-defined business needs. This implies that users will have to wait till the data is brought into the data warehouse. This seriously impacts the on-demand capability of business data analysis.

In the world of Google, users demand a rapid and easy search for all their enterprise data. Many of the traditional data warehousing solutions - do not support an easy search tools. Thus customers cannot find the required data and it limits users' ability to make best use of data warehouses for rapid on-demand data analysis.

Today's Need

Modern data analytics - be it Big Data or BI or BW require:

  1. Support multiple types (structured/unstructured) of data to be stored in its raw form - along with source details.
  2. Allow rapid ingestion of data - to support real time or near real time analysis
  3. Handle & manage very large data sets - both in terms of data streams and data sizes.
  4. Allow multiple users to search, access and use this data simultaneously from a well known secure place.

Looking at all the demands of modern business, the solution that fits all of the above criteria is the Data lake.

What is a Data Lake? 

A Data Lake is a data storing solution featuring a scalable data stores - to store vast amounts of data in various formats. Data from multiple sources: Databases, Web server logs, Point-of-sale devices, IoT sensors, ERP/business systems, Social media, third party information sources etc are all collected, curated into this data lake via an ingestion process. Data can flow into the Data Lake by either batch processing or real-time processing of streaming data.

Data lake holds both raw & processed data along with all the metadata and lineage of the data which is available in a common searchable data catalog. Data is no longer restrained by initial schema decisions, and can be used more freely across the enterprise.

Data Lake is an architected data solution - on which all the common compliance & security policies also applied.

Businesses can now use this data on demand to provide Data and Analytics as a Service  (DAaaS) model to various consumers. ( Business users, data scientists, business analysts)

Note: Data Lakes are often built around a strong scalable, globally distributed storage systems. Please refer my other articles regarding storage for Data lake

Data Lake: Storage for Hadoop & Big Data Analytics

Understanding Data in Big Data

Uses of Data Lake

Data Lake is the place were raw data is ingested, curated and used for modification via ETL tools. Existing data warehouse tools can use this data for analysis along with newer big data, AI tools.

Once a data lake is created, users can use a wide range of analytics tools of their choice to develop reports, develop insights and act on it. The data lake holds both raw data & transformed data along with all the metadata associated with the data.

DAaaS model enables users to self-serve their data and analytic needs. Users browse the data lake's catalog to find and select the available data and fill a metaphorical "shopping cart" with data to work with.

Broadly speaking, there are six main uses of data lake:

  1. Discover: Automatically and incrementally "fingerprint" data at scale by analyzing source data.
  2. Organize: Use machine learning to automatically tag and match data fingerprints to glossary terms. Match the unmatched terms through crowd sourcing
  3. Curate: Human review accepts or rejects tags and automates data access control via tag based security
  4. Search: Search for data through the Waterline GUI or through integration via 3rd party applications
  5. Rate: Use objective profiling information along with subjective crowdsourced input to rate data quality
  6. Collaborate: Crowdsource annotations and ratings to collaborate and share "tribal knowledge" about your data

Characteristics of a Successful Data Lake Implementation

Data Lake enables users to analyze the full variety and volume of data stored in the lake. This necessitates features and functionalities to secure and curate the data, and then to run analytics, visualization, and reporting on it. The characteristics of a successful Data Lake include:

  1. Use of multiple tools and products. Extracting maximum value out of the Data Lake requires customized management and integration that are currently unavailable from any single open-source platform or commercial product vendor. The cross-engine integration necessary for a successful Data Lake requires multiple technology stacks that natively support structured, semi-structured, and unstructured data types.
  2. Domain specification. The Data Lake must be tailored to the specific industry. A Data Lake customized for biomedical research would be significantly different from one tailored to financial services. The Data Lake requires a business-aware data-locating capability that enables business users to find, explore, understand, and trust the data. This search capability needs to provide an intuitive means for navigation, including key word, faceted, and graphical search. Under the covers, such a capability requires sophisticated business processes, within which business terminology can be mapped to the physical data. The tools used should enable independence from IT so that business users can obtain the data they need when they need it and can analyze it as necessary, without IT intervention.
  3. Automated metadata management. The Data Lake concept relies on capturing a robust set of attributes for every piece of content within the lake. Attributes like data lineage, data quality, and usage history are vital to usability. Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking facility. Without a high-degree of automated and mandatory metadata management, a Data Lake will rapidly become a Data Swamp.
  4. Configurable ingestion workflows. In a thriving Data Lake, new sources of external information will be continually discovered by business users. These new sources need to be rapidly on-boarded to avoid frustration and to realize immediate opportunities. A configuration-driven, ingestion workflow mechanism can provide a high level of reuse, enabling easy, secure, and trackable content ingestion from new sources.
  5. Integrate with the existing environment. The Data Lake needs to meld into and support the existing enterprise data management paradigms, tools, and methods. It needs a supervisor that integrates and manages, when required, existing data management tools, such as data profiling, data mastering and cleansing, and data masking technologies.

Keeping all of these elements in mind is critical for the design of a successful Data Lake.

Designing the Data Lake

Designing a successful Data Lake is an intensive endeavor, requiring a comprehensive understanding of the technical requirements and the business acumen to fully customize and integrate the architecture for the organization's specific needs. Data Scientists and Engineers provide the expertise necessary to evolve the Data Lake to a successful Data and Analytics as a Service solution, including:

DAaaS Strategy Service Definition. Data users can leverage define the catalog of services to be provided by the DAaaS platform, including data onboarding, data cleansing, data transformation, data catalogs, analytic tool libraries, and others.

DAaaS Architecture. Datalake help data users create a right DAaaS architecture, including architecting the environment, selecting components, defining engineering processes, and designing user interfaces.

DAaaS PoC. Rapidly design and execute Proofs-of-Concept (PoC) to demonstrate the viability of the DAaaS approach. Key capabilities of the DAaaS platform are built/demonstrated using leading-edge bases and other selected tools.

DAaaS Operating Model Design and Rollout. Customize DAaaS operating models to meet the individual business users' processes, organizational structure, rules, and governance. This includes establishing DAaaS chargeback models, consumption tracking, and reporting mechanisms.

DAaaS Platform Capability Build-Out. Provide an iterative build-out of all data analytics platform capabilities, including design, development and integration, testing, data loading, metadata and catalog population, and rollout.

Closing Thoughts  

Data Lake can be an effective data management solution for advanced analytics experts and business users alike. A Data Lake allows users to analyze a large variety and volume when and how they want. DAaaS model provides users with on-demand, self-serve data for all their analysis needs

However, to be successful, a Data Lake needs to leverage a multitude of products while being tailored to the industry and providing users with extensive, scalable customization- In short, it takes a blend of technical expertise and business acumen to help organizations design and implement their perfect Data Lake. 

1 comment:

Mahalya sree said...
This comment has been removed by a blog administrator.