Tuesday, August 16, 2016

Five core attributes of a streaming data platform

Currently, I am planning a system to handle streaming data. Today, in any data driven organization, there are several new streaming data sources such as mobile apps, IoT devices, Sensors, IP cameras, websites, Point-of-sale devices etc

Before designing a system, first we need to understand the attributes that are necessary to implement an integrated streaming data platform.

As a data-driven organization, the first core consideration is to understand what it takes to acquire all the streaming data: The Variety of data, Velocity of data and Volume of data. Once these three parameters are known, it is time to plan and design of an integrated streaming platform and allow for both the acquisition of streaming data and the analytics that make streaming applications.

For the system design, there are five core attributes that needs to be considered:

1. System Latency
2. System Salability
3. System Diversity
4. Durability
5. Centralized Data Management

System Latency

Streaming data platforms need to match the pace of the incoming data from various sources that is part of the stream. One of the keys to streaming data platforms is the ability to match the speed of data acquisition, data ingestion into the data lake.

This implies designing the data pipelines that required to transfer data into data lake, running the necessary processes to parse & prepare data for analytics on Hadoop clusters. Data quality is a key parameter here. It takes definite amount of time and compute resources for data to be sanitized - to avoid "garbage in, garbage out" situations.

Data Security and authenticity has to be established first before data gets ingested into data lake. As data is collected from a disparate sources, basic level of data authentication and validation must be carried out so the core data lake is not corrupted. For example in case of Real time Traffic analysis for Bangalore, one needs to check if the data is indeed coming from sensors in Bangalore, else all analysis will not be accurate.

Data security is one of the most critical components to consider when designing a data lake. The subject of security can be a lengthy one and has to be customized for each use case. There are many open source tools available that address data governance and data security; Apache Atlas (Data Governance Initiative by Hortonworks), Apache Falcon (automates data pipelines and data replication for recovery and other emergencies), Apache Knox Gateway (provides edge protection) and Apache Ranger (authentication and authorization platform for Hadoop) provided in Hortonworks Data Platform, and Cloudera Navigator (Cloudera enterprise edition) provided in the Cloudera Data Hub.

Depending on Data producers, data ingestion can be "pushed" or "pulled". The choice of pull or push defines the choice of tools and strategies for data ingestion.  If there is a need for real-time analytics, then the system requirements needs to be taken into consideration. In case of real-time analytics, the streaming data has to be fed into the compute farm without delays. This implies planning out the network capacities, compute memory sizing and compute capacity planning.

In case of on-demand or near real time analytics, then there will be additional latency for the data to be landed into the data warehouse or get ingested into data lake. Then the systems needed to feed the ingested data to BI system or a Hadoop cluster for analysis. If there are location based dependencies, then one needs to build a distributed Hadoop clusters.

System Salability

The size of data streaming from device is not a constant. As more data collection devices are added or when the data collection devices are upgraded the size of incoming data stream increases. For example, incase of IP Cameras, the data stream size will increase when the number of cameras increase or when the cameras are upgraded to collect higher resolution images or when more data is collated - in form of infrared images, voice etc.

Streaming data platforms need to be able to match the projected growth of data sizes. This means that streaming data platforms will need to be able to stream data from a large number of sources or/and bigger data sizes. As data size changes, all the connected infrastructure must be capable to scaling up (or scale out) to meet the new demands.

System Diversity

The system must be designed to handle diverse set of data sources. Streaming data platforms will need to support not just "new era" data sources from mobile devices, cloud sources, or the Internet of Things. Streaming data platforms will also be required to support "legacy" platforms such as relational databases, data warehouses, and operational applications like ERP, CRM, and SCM. These are the platforms with the information to place streaming devices, mobile apps, and browser click information into context to provide value-added insights.

System Durability 

Once the data is captured in the system and is registered in the data lake, the value of historical data depends on the value of historical analysis. Many of the data sets at the source could have changed and data needs to be constantly updated for meaningful analysis (or purged). This must be policy/rule based data refresh.  

Centralized Data Management 

One of the core tenants of a streaming data platform is to make the entire streaming system easy to monitor and manage. This make the system easier to maintain and sustain. Using a centralized architecture, streaming data platforms can not only reduce the number of potential connections between streaming data sources and streaming data destinations, but they can provide a centralized repository of technical and business meta data to enable common data formats and transformations.

A data streaming system can easily contain hundreds of nodes. This makes it important to use tools that monitor, control, and process the data lake. Currently, there are two main options for this: Cloudera Manager and Hortonworks' open source tool Ambari. With such tools, you can easily decrease or increase the size of your data lake while finding and addressing issues such as bad hard drives, node failure, and stopped services. Such tools also make it easy to add new services, ensure compatible tool versions, and upgrade the data lake.

For a complete list of streaming data tools see: https://hadoopecosystemtable.github.io/ 

Closing Thoughts 

Designing a streaming data platform for data analysis is not easy. But having these five core attributes as the foundation for designing  the streaming data platform, one can ensure it is built on robust platform and a complete solution that will meet the needs of the data-driven organization will be built upon.

Understanding the Business Benefits of Colocation

Digital transformation and the move towards private cloud is really shaking up the design and implementation of data centers. As companies start their journey to the cloud, they realize that having sets of dedicated servers for each application will not help them and they need to change their data centers.

Historically, companies started out with a small server room to host a few servers that run their business applications. The server room was located in their office space and it was a small set up. As business became more compute centric, the small server room became unviable. This led to data centers - which was often located in their office.

The office buildings were not designed for hosting data centers and had to be modified to get more air-conditioning, networking and power into the data center. The limitation of existing buildings created limitation on efficient cooling and power management.

But now when companies are planning to move to private cloud, they are now seeing huge benefits of having a dedicated data center - a purpose built facility for hosting large number of computers, switches, storage and power systems. These purpose built data centers are built with better power supply solutions, better & more efficient liquid cooling solutions and more importantly offer a wide range of networking connectivity and network services.

As a result these dedicated data centers can save money on IT operations and also provide greater reliability & resilience.

But not all companies need a large data center which can benefit from economies of scale. Very few large enterprises really have a need for such large dedicated purpose built data centers. So there is a new solution - Colocation of Data Centers.

For CIOs colocation provides the perfect win-win scenario, providing cost savings and delivering state-of-the-art infrastructure. When comparing the capabilities of a standard server room to a colocated data center solution, often times, the benefit from power bills alone is enough to justify the project.

These dedicated data centers are built on large scale in Industrial zones with dedicated power lines and backup power systems - that the power cost will be much lower than before. Moreover, these dedicated data centers can employ newer & more efficient cooling systems that reduces the over all power consumption of the data center.

Business Benefits of Colocation

Apart from reductions in operational expenditure, there are several other benefits from colocation. Having a dedicated team where people are available 24/7/365 to monitor and manage the IT infrastructure is a huge benefit.


  1. Cost Savings on Power & Taxes

    Dedicated data centers are built in locations that offer cheap power. Companies can also negotiate the tax breaks for building in remote or industrial areas. In addition to a lower price of power, the data centers are designed to include diverse power feeds and efficient distribution paths. These data centers are dual generator systems that can be refueled while in operation as well as on-site fuel reserves, and have multiple UPS support in place.

    In addition to power costs, dedicated data centers will have engineers and technicians who will monitor the power levels, battery levels 24/7 so that the center has 100% uptime.

    Additionally, data centers have the time, resources and impetus to continually invest in and research green technologies. This means that businesses can reduce their carbon footprint at their office locations and benefit from continual efficiency saving research. Companies that move their servers from in-house server rooms typically save 90 percent on their own carbon emissions.

  2. Network Connected Globally, Securely and QuicklyToday, high speed network connectivity is the key to business. And it is lot more difficult to get big fat pipes of network connectivity into central office locations. It is lot more easier to get network connectivity to a centralized data center. A dedicated data center will have many network service providers providing connectivity and often at a lower price than at a office location.

    Dedicated data centers also provide resilient connectivity at a fairly low price – delivering 100 Mbps of bandwidth might be hard at an office location and trying to create a redundant solution is often financially unviable. Data centers are connected to multiple transit providers and also have large bandwidth pipes meaning that businesses often benefit from a better service for less cost.

    Colocation enables organizations to benefit from faster networking and cheaper network connections.

  3. Monitoring IT InfrastructureBuilding a dedicated data center makes it easier to monitor the health of IT infrastructure. The economies of scale that comes from colocation helps to build a robust IT Infrastructure monitoring solution that can monitor the entire IT infrastructure and ensure SLAs are being met.

  4. Better SecurityA dedicated data center and colocation will have better physical security than a data center in a office location. The physical isolation of the data center enables the service provider to provide better security measures that include biometric scanners, closed circuit cameras, on-site security, coded access, alarm systems, ISO 27001 accredited processes, onsite security teams and more. With colocation, all these service costs are shared - thus bringing down the costs while improving the level of security.

  5. ScalabilityPlatform 3 paradigms such as digital transformation, IoT, Big Data, etc are driving up the scale of IT infrastructure. As the demand for computing shoots up with time, the data center must be able to cope up with it. With colocation, the scale up requirements can be negotiated ahead of time and with just one call to the colocation provider and scale of the IT  infrastructure can be increased as needed.

    Data centers and colocation providers have the ability to have businesses up and running within hours, as well as provide the flexibility to grow alongside your organization. Colocation space, power, bandwidth and connection speeds can all be increased when required.

    The complexity of rack space management, power  management etc is outsourced to the colocation service provider.

  6. Environment friendly & Green ITA large scale data center has more incentives to run a greener IT operations as it results in lower energy costs. Often times these data centers are located in Industrial areas where better cooling technologies can be safely deployed - which makes it possible to improve the over all operational efficiency. Typically, a colocated data center often adhere to global green standards such as ASHRAE 90.1-2013, ASHRAE 62.1, LEED certifications etc.

    A bigger data centers enable better IT ewaste management and recycling. Old computers, UPS Batteries and other equipment can be safely & securely disposed.

  7. Additional Services from Colocational PartnersColocational service providers may also host other cloud services such as:

    a. Elastic salability to ramp up IT resources when there is seasonal demand and scale down when demand falls.

    b. Data Backups and Data Archiving to tapes and storing it in secure location

    c. Disaster Recovery in a multiple data centers & Data Redundancy to protect from data loss incase of natural disasters.

    d. Network Security and Monitoring against malicious network attacks - this is usually in form of a "Critical Incident Center." Critical Incident centers are like Network Operation Center - but monitors the data security and active network security. 
Closing Thoughts

In conclusion, a dedicated data center offers tremendous cost advantages. But if the company's IT scale does not warrant a dedicated data center, then the best option is to move to a colocated data center. Colocation providers are able to meet business requirements at a lower cost than if the service was kept in-house.

A colocation solution provides companies with a variety of opportunities, with exceptional SLAs and having data secured off-site, providing organizations with added levels of risk management and the chance to invest in better equipment and state-of-the-art servers.      

Why Workload Automation is critical in Modern Data Centers


Today, all CIOs demand that their data centers be able to mange dynamic workloads and have the ability to scale up or scale down dynamically based on their workloads. The demands of digital transformation of various business process implies that the entire IT services stack must be built for "Everything-As-A-Service." And this needs to be intelligent, self-learning and self healing IT system.

The always ON paradigm of the digital era brings in its own set of IT challenges. Today, IT systems are expected to:

Manage all incoming data. Data from a wide variety of sources and formats must be imported, normalized, sorted and managed for business applications. The volume of incoming data can vary greatly - but the systems must be able to handle it. IT systems must accommodate a wide variety of large scale data sources and formats, including Big Data technologies and integration with legacy in-house and third-party applications.


  • Enable Business Apps to sequence data, run data analysis and generate reports on demand. On demand workloads makes it difficult to predict future workloads on IT systems - but the systems must be able to handle it.
  • Ensure all Business Apps adhere to the published SLA.


This expectation on IT systems places a tremendous pressure to automate the management of all IT resources. As a result CIOs want:

  1. All workloads are effectively spread across n-tier architectures, across heterogeneous compute resources and across global data center networks.
  2. Predictively detect IT infrastructure failures, Automatically remediated failed/disrupted processes and workflows in near real time.
  3. Intelligently predict future work loads to automatically apply policies about when and where data can reside and how processes can be executed.


The traditional IT workload management solutions relied on time based scheduling to move data and integrate workloads. This is no longer sustainable as it takes too much time and delays in responses to the modern business needs. 

As a result, we need an intelligent workload automation system which can not only automate the workload management, but also made intelligent policy based decisions on how to manage business work loads.

Today, the IT industry has responded by developing plethora of automation tools such as Puppet, Chef, Ansible etc. Initially these were designed to simplify IT operations and automate IT operations - mainly support rapid application upgrades and deployments driven by the adoption of DevOps strategies.  

However, these tools have to be integrated with deep learning or machine learning systems to:

  1. Respond dynamically to unpredictable, on-demand changes in human-to-machine and machine-to-machine interactions.
  2. Anticipate, predict, and accommodate support for fluctuating workload requirements while maintaining business service-level agreements (SLAs)
  3. Reduce overall cost of operations by enabling IT generalists to take full advantage of sophisticated workload management capabilities via easy-to-use self-service interfaces, templates, and design tools.


Over the next few years, we will see a large number of automation tools that can collectively address the needs of legacy IT systems (such as ERM, Databases, Business collaboration tools: Emails, fileshare, unified communications, and  eCommerce, Webservices) and 3rd platform Business Apps - mainly entered around IoT, Big-Data, Media streaming & Mobile platforms. 

Current digital transformation of the global economy is driving all business operations and services to become more digitized, mobile, and Interactive. This leads to increasing complexity of everyday transactions - which translates to complex workflows and application architectures. For example - a single online taxi booking transactions will involve multiple queries to GIS systems, transactions and data exchanges across several legacy & modern systems.

To succeed in this demanding environment, one needs an intelligent, scalable and a flexible IT workload automation solutions.