Currently, I am planning a system to handle streaming data. Today, in any data driven organization, there are several new streaming data sources such as mobile apps, IoT devices, Sensors, IP cameras, websites, Point-of-sale devices etc
Before designing a system, first we need to understand the attributes that are necessary to implement an integrated streaming data platform.
As a data-driven organization, the first core consideration is to understand what it takes to acquire all the streaming data: The Variety of data, Velocity of data and Volume of data. Once these three parameters are known, it is time to plan and design of an integrated streaming platform and allow for both the acquisition of streaming data and the analytics that make streaming applications.
For the system design, there are five core attributes that needs to be considered:
1. System Latency
2. System Salability
3. System Diversity
4. Durability
5. Centralized Data Management
System Latency
Streaming data platforms need to match the pace of the incoming data from various sources that is part of the stream. One of the keys to streaming data platforms is the ability to match the speed of data acquisition, data ingestion into the data lake.
This implies designing the data pipelines that required to transfer data into data lake, running the necessary processes to parse & prepare data for analytics on Hadoop clusters. Data quality is a key parameter here. It takes definite amount of time and compute resources for data to be sanitized - to avoid "garbage in, garbage out" situations.
Data Security and authenticity has to be established first before data gets ingested into data lake. As data is collected from a disparate sources, basic level of data authentication and validation must be carried out so the core data lake is not corrupted. For example in case of Real time Traffic analysis for Bangalore, one needs to check if the data is indeed coming from sensors in Bangalore, else all analysis will not be accurate.
Data security is one of the most critical components to consider when designing a data lake. The subject of security can be a lengthy one and has to be customized for each use case. There are many open source tools available that address data governance and data security; Apache Atlas (Data Governance Initiative by Hortonworks), Apache Falcon (automates data pipelines and data replication for recovery and other emergencies), Apache Knox Gateway (provides edge protection) and Apache Ranger (authentication and authorization platform for Hadoop) provided in Hortonworks Data Platform, and Cloudera Navigator (Cloudera enterprise edition) provided in the Cloudera Data Hub.
Depending on Data producers, data ingestion can be "pushed" or "pulled". The choice of pull or push defines the choice of tools and strategies for data ingestion. If there is a need for real-time analytics, then the system requirements needs to be taken into consideration. In case of real-time analytics, the streaming data has to be fed into the compute farm without delays. This implies planning out the network capacities, compute memory sizing and compute capacity planning.
In case of on-demand or near real time analytics, then there will be additional latency for the data to be landed into the data warehouse or get ingested into data lake. Then the systems needed to feed the ingested data to BI system or a Hadoop cluster for analysis. If there are location based dependencies, then one needs to build a distributed Hadoop clusters.
System Salability
The size of data streaming from device is not a constant. As more data collection devices are added or when the data collection devices are upgraded the size of incoming data stream increases. For example, incase of IP Cameras, the data stream size will increase when the number of cameras increase or when the cameras are upgraded to collect higher resolution images or when more data is collated - in form of infrared images, voice etc.
Streaming data platforms need to be able to match the projected growth of data sizes. This means that streaming data platforms will need to be able to stream data from a large number of sources or/and bigger data sizes. As data size changes, all the connected infrastructure must be capable to scaling up (or scale out) to meet the new demands.
System Diversity
The system must be designed to handle diverse set of data sources. Streaming data platforms will need to support not just "new era" data sources from mobile devices, cloud sources, or the Internet of Things. Streaming data platforms will also be required to support "legacy" platforms such as relational databases, data warehouses, and operational applications like ERP, CRM, and SCM. These are the platforms with the information to place streaming devices, mobile apps, and browser click information into context to provide value-added insights.
System Durability
Once the data is captured in the system and is registered in the data lake, the value of historical data depends on the value of historical analysis. Many of the data sets at the source could have changed and data needs to be constantly updated for meaningful analysis (or purged). This must be policy/rule based data refresh.
Centralized Data Management
One of the core tenants of a streaming data platform is to make the entire streaming system easy to monitor and manage. This make the system easier to maintain and sustain. Using a centralized architecture, streaming data platforms can not only reduce the number of potential connections between streaming data sources and streaming data destinations, but they can provide a centralized repository of technical and business meta data to enable common data formats and transformations.
A data streaming system can easily contain hundreds of nodes. This makes it important to use tools that monitor, control, and process the data lake. Currently, there are two main options for this: Cloudera Manager and Hortonworks' open source tool Ambari. With such tools, you can easily decrease or increase the size of your data lake while finding and addressing issues such as bad hard drives, node failure, and stopped services. Such tools also make it easy to add new services, ensure compatible tool versions, and upgrade the data lake.
For a complete list of streaming data tools see: https://hadoopecosystemtable.github.io/
Closing Thoughts
Designing a streaming data platform for data analysis is not easy. But having these five core attributes as the foundation for designing the streaming data platform, one can ensure it is built on robust platform and a complete solution that will meet the needs of the data-driven organization will be built upon.
Before designing a system, first we need to understand the attributes that are necessary to implement an integrated streaming data platform.
As a data-driven organization, the first core consideration is to understand what it takes to acquire all the streaming data: The Variety of data, Velocity of data and Volume of data. Once these three parameters are known, it is time to plan and design of an integrated streaming platform and allow for both the acquisition of streaming data and the analytics that make streaming applications.
For the system design, there are five core attributes that needs to be considered:
1. System Latency
2. System Salability
3. System Diversity
4. Durability
5. Centralized Data Management
System Latency
Streaming data platforms need to match the pace of the incoming data from various sources that is part of the stream. One of the keys to streaming data platforms is the ability to match the speed of data acquisition, data ingestion into the data lake.
This implies designing the data pipelines that required to transfer data into data lake, running the necessary processes to parse & prepare data for analytics on Hadoop clusters. Data quality is a key parameter here. It takes definite amount of time and compute resources for data to be sanitized - to avoid "garbage in, garbage out" situations.
Data Security and authenticity has to be established first before data gets ingested into data lake. As data is collected from a disparate sources, basic level of data authentication and validation must be carried out so the core data lake is not corrupted. For example in case of Real time Traffic analysis for Bangalore, one needs to check if the data is indeed coming from sensors in Bangalore, else all analysis will not be accurate.
Data security is one of the most critical components to consider when designing a data lake. The subject of security can be a lengthy one and has to be customized for each use case. There are many open source tools available that address data governance and data security; Apache Atlas (Data Governance Initiative by Hortonworks), Apache Falcon (automates data pipelines and data replication for recovery and other emergencies), Apache Knox Gateway (provides edge protection) and Apache Ranger (authentication and authorization platform for Hadoop) provided in Hortonworks Data Platform, and Cloudera Navigator (Cloudera enterprise edition) provided in the Cloudera Data Hub.
Depending on Data producers, data ingestion can be "pushed" or "pulled". The choice of pull or push defines the choice of tools and strategies for data ingestion. If there is a need for real-time analytics, then the system requirements needs to be taken into consideration. In case of real-time analytics, the streaming data has to be fed into the compute farm without delays. This implies planning out the network capacities, compute memory sizing and compute capacity planning.
In case of on-demand or near real time analytics, then there will be additional latency for the data to be landed into the data warehouse or get ingested into data lake. Then the systems needed to feed the ingested data to BI system or a Hadoop cluster for analysis. If there are location based dependencies, then one needs to build a distributed Hadoop clusters.
System Salability
The size of data streaming from device is not a constant. As more data collection devices are added or when the data collection devices are upgraded the size of incoming data stream increases. For example, incase of IP Cameras, the data stream size will increase when the number of cameras increase or when the cameras are upgraded to collect higher resolution images or when more data is collated - in form of infrared images, voice etc.
Streaming data platforms need to be able to match the projected growth of data sizes. This means that streaming data platforms will need to be able to stream data from a large number of sources or/and bigger data sizes. As data size changes, all the connected infrastructure must be capable to scaling up (or scale out) to meet the new demands.
System Diversity
The system must be designed to handle diverse set of data sources. Streaming data platforms will need to support not just "new era" data sources from mobile devices, cloud sources, or the Internet of Things. Streaming data platforms will also be required to support "legacy" platforms such as relational databases, data warehouses, and operational applications like ERP, CRM, and SCM. These are the platforms with the information to place streaming devices, mobile apps, and browser click information into context to provide value-added insights.
System Durability
Once the data is captured in the system and is registered in the data lake, the value of historical data depends on the value of historical analysis. Many of the data sets at the source could have changed and data needs to be constantly updated for meaningful analysis (or purged). This must be policy/rule based data refresh.
Centralized Data Management
One of the core tenants of a streaming data platform is to make the entire streaming system easy to monitor and manage. This make the system easier to maintain and sustain. Using a centralized architecture, streaming data platforms can not only reduce the number of potential connections between streaming data sources and streaming data destinations, but they can provide a centralized repository of technical and business meta data to enable common data formats and transformations.
A data streaming system can easily contain hundreds of nodes. This makes it important to use tools that monitor, control, and process the data lake. Currently, there are two main options for this: Cloudera Manager and Hortonworks' open source tool Ambari. With such tools, you can easily decrease or increase the size of your data lake while finding and addressing issues such as bad hard drives, node failure, and stopped services. Such tools also make it easy to add new services, ensure compatible tool versions, and upgrade the data lake.
For a complete list of streaming data tools see: https://hadoopecosystemtable.github.io/
Closing Thoughts
Designing a streaming data platform for data analysis is not easy. But having these five core attributes as the foundation for designing the streaming data platform, one can ensure it is built on robust platform and a complete solution that will meet the needs of the data-driven organization will be built upon.