Tuesday, August 16, 2016

Five core attributes of a streaming data platform

Currently, I am planning a system to handle streaming data. Today, in any data driven organization, there are several new streaming data sources such as mobile apps, IoT devices, Sensors, IP cameras, websites, Point-of-sale devices etc

Before designing a system, first we need to understand the attributes that are necessary to implement an integrated streaming data platform.

As a data-driven organization, the first core consideration is to understand what it takes to acquire all the streaming data: The Variety of data, Velocity of data and Volume of data. Once these three parameters are known, it is time to plan and design of an integrated streaming platform and allow for both the acquisition of streaming data and the analytics that make streaming applications.

For the system design, there are five core attributes that needs to be considered:

1. System Latency
2. System Salability
3. System Diversity
4. Durability
5. Centralized Data Management

System Latency

Streaming data platforms need to match the pace of the incoming data from various sources that is part of the stream. One of the keys to streaming data platforms is the ability to match the speed of data acquisition, data ingestion into the data lake.

This implies designing the data pipelines that required to transfer data into data lake, running the necessary processes to parse & prepare data for analytics on Hadoop clusters. Data quality is a key parameter here. It takes definite amount of time and compute resources for data to be sanitized - to avoid "garbage in, garbage out" situations.

Data Security and authenticity has to be established first before data gets ingested into data lake. As data is collected from a disparate sources, basic level of data authentication and validation must be carried out so the core data lake is not corrupted. For example in case of Real time Traffic analysis for Bangalore, one needs to check if the data is indeed coming from sensors in Bangalore, else all analysis will not be accurate.

Data security is one of the most critical components to consider when designing a data lake. The subject of security can be a lengthy one and has to be customized for each use case. There are many open source tools available that address data governance and data security; Apache Atlas (Data Governance Initiative by Hortonworks), Apache Falcon (automates data pipelines and data replication for recovery and other emergencies), Apache Knox Gateway (provides edge protection) and Apache Ranger (authentication and authorization platform for Hadoop) provided in Hortonworks Data Platform, and Cloudera Navigator (Cloudera enterprise edition) provided in the Cloudera Data Hub.

Depending on Data producers, data ingestion can be "pushed" or "pulled". The choice of pull or push defines the choice of tools and strategies for data ingestion.  If there is a need for real-time analytics, then the system requirements needs to be taken into consideration. In case of real-time analytics, the streaming data has to be fed into the compute farm without delays. This implies planning out the network capacities, compute memory sizing and compute capacity planning.

In case of on-demand or near real time analytics, then there will be additional latency for the data to be landed into the data warehouse or get ingested into data lake. Then the systems needed to feed the ingested data to BI system or a Hadoop cluster for analysis. If there are location based dependencies, then one needs to build a distributed Hadoop clusters.

System Salability

The size of data streaming from device is not a constant. As more data collection devices are added or when the data collection devices are upgraded the size of incoming data stream increases. For example, incase of IP Cameras, the data stream size will increase when the number of cameras increase or when the cameras are upgraded to collect higher resolution images or when more data is collated - in form of infrared images, voice etc.

Streaming data platforms need to be able to match the projected growth of data sizes. This means that streaming data platforms will need to be able to stream data from a large number of sources or/and bigger data sizes. As data size changes, all the connected infrastructure must be capable to scaling up (or scale out) to meet the new demands.

System Diversity

The system must be designed to handle diverse set of data sources. Streaming data platforms will need to support not just "new era" data sources from mobile devices, cloud sources, or the Internet of Things. Streaming data platforms will also be required to support "legacy" platforms such as relational databases, data warehouses, and operational applications like ERP, CRM, and SCM. These are the platforms with the information to place streaming devices, mobile apps, and browser click information into context to provide value-added insights.

System Durability 

Once the data is captured in the system and is registered in the data lake, the value of historical data depends on the value of historical analysis. Many of the data sets at the source could have changed and data needs to be constantly updated for meaningful analysis (or purged). This must be policy/rule based data refresh.  

Centralized Data Management 

One of the core tenants of a streaming data platform is to make the entire streaming system easy to monitor and manage. This make the system easier to maintain and sustain. Using a centralized architecture, streaming data platforms can not only reduce the number of potential connections between streaming data sources and streaming data destinations, but they can provide a centralized repository of technical and business meta data to enable common data formats and transformations.

A data streaming system can easily contain hundreds of nodes. This makes it important to use tools that monitor, control, and process the data lake. Currently, there are two main options for this: Cloudera Manager and Hortonworks' open source tool Ambari. With such tools, you can easily decrease or increase the size of your data lake while finding and addressing issues such as bad hard drives, node failure, and stopped services. Such tools also make it easy to add new services, ensure compatible tool versions, and upgrade the data lake.

For a complete list of streaming data tools see: https://hadoopecosystemtable.github.io/ 

Closing Thoughts 

Designing a streaming data platform for data analysis is not easy. But having these five core attributes as the foundation for designing  the streaming data platform, one can ensure it is built on robust platform and a complete solution that will meet the needs of the data-driven organization will be built upon.

Understanding the Business Benefits of Colocation

Digital transformation and the move towards private cloud is really shaking up the design and implementation of data centers. As companies start their journey to the cloud, they realize that having sets of dedicated servers for each application will not help them and they need to change their data centers.

Historically, companies started out with a small server room to host a few servers that run their business applications. The server room was located in their office space and it was a small set up. As business became more compute centric, the small server room became unviable. This led to data centers - which was often located in their office.

The office buildings were not designed for hosting data centers and had to be modified to get more air-conditioning, networking and power into the data center. The limitation of existing buildings created limitation on efficient cooling and power management.

But now when companies are planning to move to private cloud, they are now seeing huge benefits of having a dedicated data center - a purpose built facility for hosting large number of computers, switches, storage and power systems. These purpose built data centers are built with better power supply solutions, better & more efficient liquid cooling solutions and more importantly offer a wide range of networking connectivity and network services.

As a result these dedicated data centers can save money on IT operations and also provide greater reliability & resilience.

But not all companies need a large data center which can benefit from economies of scale. Very few large enterprises really have a need for such large dedicated purpose built data centers. So there is a new solution - Colocation of Data Centers.

For CIOs colocation provides the perfect win-win scenario, providing cost savings and delivering state-of-the-art infrastructure. When comparing the capabilities of a standard server room to a colocated data center solution, often times, the benefit from power bills alone is enough to justify the project.

These dedicated data centers are built on large scale in Industrial zones with dedicated power lines and backup power systems - that the power cost will be much lower than before. Moreover, these dedicated data centers can employ newer & more efficient cooling systems that reduces the over all power consumption of the data center.

Business Benefits of Colocation

Apart from reductions in operational expenditure, there are several other benefits from colocation. Having a dedicated team where people are available 24/7/365 to monitor and manage the IT infrastructure is a huge benefit.


  1. Cost Savings on Power & Taxes

    Dedicated data centers are built in locations that offer cheap power. Companies can also negotiate the tax breaks for building in remote or industrial areas. In addition to a lower price of power, the data centers are designed to include diverse power feeds and efficient distribution paths. These data centers are dual generator systems that can be refueled while in operation as well as on-site fuel reserves, and have multiple UPS support in place.

    In addition to power costs, dedicated data centers will have engineers and technicians who will monitor the power levels, battery levels 24/7 so that the center has 100% uptime.

    Additionally, data centers have the time, resources and impetus to continually invest in and research green technologies. This means that businesses can reduce their carbon footprint at their office locations and benefit from continual efficiency saving research. Companies that move their servers from in-house server rooms typically save 90 percent on their own carbon emissions.

  2. Network Connected Globally, Securely and QuicklyToday, high speed network connectivity is the key to business. And it is lot more difficult to get big fat pipes of network connectivity into central office locations. It is lot more easier to get network connectivity to a centralized data center. A dedicated data center will have many network service providers providing connectivity and often at a lower price than at a office location.

    Dedicated data centers also provide resilient connectivity at a fairly low price – delivering 100 Mbps of bandwidth might be hard at an office location and trying to create a redundant solution is often financially unviable. Data centers are connected to multiple transit providers and also have large bandwidth pipes meaning that businesses often benefit from a better service for less cost.

    Colocation enables organizations to benefit from faster networking and cheaper network connections.

  3. Monitoring IT InfrastructureBuilding a dedicated data center makes it easier to monitor the health of IT infrastructure. The economies of scale that comes from colocation helps to build a robust IT Infrastructure monitoring solution that can monitor the entire IT infrastructure and ensure SLAs are being met.

  4. Better SecurityA dedicated data center and colocation will have better physical security than a data center in a office location. The physical isolation of the data center enables the service provider to provide better security measures that include biometric scanners, closed circuit cameras, on-site security, coded access, alarm systems, ISO 27001 accredited processes, onsite security teams and more. With colocation, all these service costs are shared - thus bringing down the costs while improving the level of security.

  5. ScalabilityPlatform 3 paradigms such as digital transformation, IoT, Big Data, etc are driving up the scale of IT infrastructure. As the demand for computing shoots up with time, the data center must be able to cope up with it. With colocation, the scale up requirements can be negotiated ahead of time and with just one call to the colocation provider and scale of the IT  infrastructure can be increased as needed.

    Data centers and colocation providers have the ability to have businesses up and running within hours, as well as provide the flexibility to grow alongside your organization. Colocation space, power, bandwidth and connection speeds can all be increased when required.

    The complexity of rack space management, power  management etc is outsourced to the colocation service provider.

  6. Environment friendly & Green ITA large scale data center has more incentives to run a greener IT operations as it results in lower energy costs. Often times these data centers are located in Industrial areas where better cooling technologies can be safely deployed - which makes it possible to improve the over all operational efficiency. Typically, a colocated data center often adhere to global green standards such as ASHRAE 90.1-2013, ASHRAE 62.1, LEED certifications etc.

    A bigger data centers enable better IT ewaste management and recycling. Old computers, UPS Batteries and other equipment can be safely & securely disposed.

  7. Additional Services from Colocational PartnersColocational service providers may also host other cloud services such as:

    a. Elastic salability to ramp up IT resources when there is seasonal demand and scale down when demand falls.

    b. Data Backups and Data Archiving to tapes and storing it in secure location

    c. Disaster Recovery in a multiple data centers & Data Redundancy to protect from data loss incase of natural disasters.

    d. Network Security and Monitoring against malicious network attacks - this is usually in form of a "Critical Incident Center." Critical Incident centers are like Network Operation Center - but monitors the data security and active network security. 
Closing Thoughts

In conclusion, a dedicated data center offers tremendous cost advantages. But if the company's IT scale does not warrant a dedicated data center, then the best option is to move to a colocated data center. Colocation providers are able to meet business requirements at a lower cost than if the service was kept in-house.

A colocation solution provides companies with a variety of opportunities, with exceptional SLAs and having data secured off-site, providing organizations with added levels of risk management and the chance to invest in better equipment and state-of-the-art servers.      

Why Workload Automation is critical in Modern Data Centers


Today, all CIOs demand that their data centers be able to mange dynamic workloads and have the ability to scale up or scale down dynamically based on their workloads. The demands of digital transformation of various business process implies that the entire IT services stack must be built for "Everything-As-A-Service." And this needs to be intelligent, self-learning and self healing IT system.

The always ON paradigm of the digital era brings in its own set of IT challenges. Today, IT systems are expected to:

Manage all incoming data. Data from a wide variety of sources and formats must be imported, normalized, sorted and managed for business applications. The volume of incoming data can vary greatly - but the systems must be able to handle it. IT systems must accommodate a wide variety of large scale data sources and formats, including Big Data technologies and integration with legacy in-house and third-party applications.


  • Enable Business Apps to sequence data, run data analysis and generate reports on demand. On demand workloads makes it difficult to predict future workloads on IT systems - but the systems must be able to handle it.
  • Ensure all Business Apps adhere to the published SLA.


This expectation on IT systems places a tremendous pressure to automate the management of all IT resources. As a result CIOs want:

  1. All workloads are effectively spread across n-tier architectures, across heterogeneous compute resources and across global data center networks.
  2. Predictively detect IT infrastructure failures, Automatically remediated failed/disrupted processes and workflows in near real time.
  3. Intelligently predict future work loads to automatically apply policies about when and where data can reside and how processes can be executed.


The traditional IT workload management solutions relied on time based scheduling to move data and integrate workloads. This is no longer sustainable as it takes too much time and delays in responses to the modern business needs. 

As a result, we need an intelligent workload automation system which can not only automate the workload management, but also made intelligent policy based decisions on how to manage business work loads.

Today, the IT industry has responded by developing plethora of automation tools such as Puppet, Chef, Ansible etc. Initially these were designed to simplify IT operations and automate IT operations - mainly support rapid application upgrades and deployments driven by the adoption of DevOps strategies.  

However, these tools have to be integrated with deep learning or machine learning systems to:

  1. Respond dynamically to unpredictable, on-demand changes in human-to-machine and machine-to-machine interactions.
  2. Anticipate, predict, and accommodate support for fluctuating workload requirements while maintaining business service-level agreements (SLAs)
  3. Reduce overall cost of operations by enabling IT generalists to take full advantage of sophisticated workload management capabilities via easy-to-use self-service interfaces, templates, and design tools.


Over the next few years, we will see a large number of automation tools that can collectively address the needs of legacy IT systems (such as ERM, Databases, Business collaboration tools: Emails, fileshare, unified communications, and  eCommerce, Webservices) and 3rd platform Business Apps - mainly entered around IoT, Big-Data, Media streaming & Mobile platforms. 

Current digital transformation of the global economy is driving all business operations and services to become more digitized, mobile, and Interactive. This leads to increasing complexity of everyday transactions - which translates to complex workflows and application architectures. For example - a single online taxi booking transactions will involve multiple queries to GIS systems, transactions and data exchanges across several legacy & modern systems.

To succeed in this demanding environment, one needs an intelligent, scalable and a flexible IT workload automation solutions. 
  

Tuesday, July 19, 2016

Why Project Managers need People Skills


Recently, I was having a discussion on why project managers need to excel in people skills. On the surface, most project managers do not have people management function and tend to discount people skills.

Project managers are often individual contributors who have no people management functions. Moreover, project managers are often overloaded with overseeing several projects, which places heavy load on their time. I have seen few project managers handling 6-8 projects simultaneously.

As a result, project managers tend to discount people skills and rely on technical skills alone. This may work in certain conditions - but in the long run, to be a really successful project manager one need good people skills.

Benefits of having good people skills:

  1. We live in a global world where projects are being executed by a globally dispersed teams. In such an environment, it is rare and a luxury to meet people face-to-face and build rapport with various stake holders. Many project managers may never see some of the stakeholders and customers face-to-face.
  2. In today's hyper competitive world, there is a constant pressure to complete projects faster than before. Reduction of project cycle times has become crucial for success.
  3. Projects and programs are becoming more complex. In software world, new projects are being launched on unproven technologies. So for a project manager, this becomes very difficult to identify project risks. The only way to correctly identify project risks is to connect with people working on that project and then get a first hand information of all the risks.
  4. Engineers often multitask. Today it is common for engineers to be working on more than one project. So unless project managers can build a good personal rapport with the team, the project can suffer with unexpected delays and defects.
  5. Project Managers work in a matrix organization, often reporting to several stake holders. Having good people skills helps in managing several stakeholders. Having good people skills will make it easier to interact with various stakeholders and get a more positive outcomes.
  6. Many organization do not have clearly demarcated management roles. Engineering execution managers often step into project manager's space and vice-versa. In such cases having good people skills help in smooth interactions and reduce friction. 

Wednesday, July 13, 2016

IoT needs Artificial Intelligence



IoT - Internet of Things has been the biggest buzzword in 2016. Yet, one fails to derive value for IoT devices, and IoT has not yet hit the mainstream and continues to be at the periphery - but with a lot of hype surrounding it.

Personally, I have tested several of these IoT devices. Wearable devices such as FitBit, Google Glass, Smart Helmets - but failed to derive value from it. The main reason, I had to work more to make sense of these devices.  In short, I had to work to make IoT work for me!

There are two main problems plaguing IoT.

1. Lack of inbuilt intelligence to derive value out of IoT
2. Power supply for IoT

In this article, I will concentrate on the first problem plaguing IoT.

Essentially, IoT produces raw data and lots of it. The type of data depends on the type of device: Sensor data in cars, heartbeat information from Pace Makers, etc. Collectively, all this sensor data usually falls under BIG DATA category.  The sheer volume of data being created by them will increase to a mind-boggling level. This data holds extremely valuable insight into what's working well or what's not – pointing out conflicts that arise and providing high-value insight into new business risks and opportunities as correlation and associations are made.

The problem is that it takes huge amount of work to find ways to analyze this huge deluge of raw data and build valuable information out of it.

For example, with a health wearable, I can get to know my heartbeat pattern, heart rate, how many steps I walked, how many calories I burnt, how much rest, how much sleep etc. But this information is useless to me unless I can analyze it and develop a plan to change my activities. The wearable IoT does not tell what I need to do to reach my health goals, nor can it handle any anomalies.

For corporates using Big Data Analytics tools (which is expensive), one can get a really valuable insight. But, it takes a large team of experts to develop a big data analytics tools & platform - which then product valuable insight. The organizational leaders, must then understand the insights and ACT on it. All this means - LOTS OF WORK!

The only way to keep up with this IoT-generated data and make sense of it is with machine learning or Artificial Intelligence.

As the rapid expansion of devices and sensors connected to the IoT continues, it will produce a huge volume of data: Big Data which can be used to develop self driving cars, save fuel in Airplanes, improve public health etc. The treasure trove of big data is valuable only when machine intelligence is built on it - which can take autonomous decisions.

While the idea of AI sounds great. There is limitless benefits of AI - which will eliminate the need for humans to intervene in daily mundane tasks.  However, the big problem will be to improve the speed and accuracy of AI.

In an IoT based AI system must be able to regulate the action without errors. If IoT & AI does not live up to its promise, then the consequences should not be disastrous. A minor glitch like home appliances that don't work together as advertised. But a life-threatening   malfunction like the Tesla car crash would force people from embracing IoT & AI.

AI is already in use in many ways, for example Netflix, Amazon, Pandora use it to recommend products/movies/songs that you may like.

Today, there are lots of opportunities in IoT-AI solution space. For example an intelligent insulin pump which can regulate insulin levels based on the person's activities and sugar levels. Similarly, driverless vehicles - trains, planes, cars etc., so that better decisions can be made with out human intervention.