Arun Kottolli: ITIL

Showing posts with label ITIL. Show all posts

Friday, August 24, 2018

Four Key Aspects of API Management

Today, APIs are transforming businesses. APIs are the core of creating new apps, customer-centric development and development of new business models.

APIs are the core of the drive towards digitization, IoT, Mobile first, Fintech and Hybrid cloud. This focus on APIs implies having a solid API management systems in place.

API Management is based on four rock solid aspects:

1. API Portal
Online portal to promote APIs.
This is essentially the first place users will come to get registered, get all API documentation, enroll in an online community & support groups.
In addition, it is good practice to provide an online API testing platform to help customers build/test their API ecosystems.

2. API Gateway
API Gateway – Securely open access for your API
Use policy driven security to secure & monitor API access to protect your API’s from unregistered usage, protect from malicious attacks. Enable DMZ-strength security between consumer apps using your APIs & internal servers

3. API Catalog
API lifecycle Management Manage the entire process of designing, developing, deploying, versioning & retiring APIs.
Build & maintain the right APIs for your business. Track complex interdependencies of APIs on various services and applications.
Design and configure policies to be applied to your APIs at runtime.

4. API Monitoring
API Consumption Management
Track consumption of APIs for governance, performance & Compliance.
Monitor for customer experience and develop comprehensive API monetization plan
Define, publish and track usage of API subscriptions and charge-back services

Friday, August 17, 2018

4 Types of Data Analytics

Data analytics can be classified into 4 types based on complexity & Value. In general, most valuable analytics are also the most complex.

1. Descriptive analytics

Descriptive analytics answers the question: What is happening now?

For example, in IT management, it tells how many applications are running in that instant of time and how well those application are working. Tools such as Cisco AppDynamics, Solarwinds NPM etc., collect huge volumes of data and analyzes and presents it in easy to read & understand format.

Descriptive analytics compiles raw data from multiple data sources to give valuable insights into what is happening & what happened in the past. However, this analytics does not what is going wrong or even explain why, but his helps trained managers and engineers to understand current situation.

2. Diagnostic analytics

Diagnostic Analytics uses real time data and historical data to automatically deduce what has gone wrong and why? Typically, diagnostic analysis is used for root cause analysis to understand why things have gone wrong.

Large amounts of data is used to find dependencies, relationships and to identify patterns to give a deep insight into a particular problem. For example, Dell - EMC Service Assurance Suite can provide fully automated root cause analysis of IT infrastructure. This helps IT organizations to rapidly troubleshoot issues & minimize downtimes.

3. Predictive analytics

Predictive analytics tells what is likely to happen next.

It uses all the historical data to identify definite pattern of events to predict what will happen next. Descriptive and diagnostic analytics are used to detect tendencies, clusters and exceptions, and predictive analytics us built on top to predict future trends.

Advanced algorithms such as forecasting models are used to predict. It is essential to understand that forecasting is just an estimate, the accuracy of which highly depends on data quality and stability of the situation, so it requires a careful treatment and continuous optimization.

For example, HPE Infosight can predict what can happen to IT systems, based on current & historical data. This helps IT companies to manage their IT infrastructure to prevent any future disruptions.

4. Prescriptive analytics

Prescriptive analytics is used to literally prescribe what action to take when a problem occurs.

It uses a vast data sets and intelligence to analyze the outcome of the possible action and then select the best option. This state-of-the-art type of data analytics requires not only historical data, but also external information from human experts (also called as Expert systems) in its algorithms to choose the bast possible decision.

Prescriptive analytics uses sophisticated tools and technologies, like machine learning, business rules and algorithms, which makes it sophisticated to implement and manage.

For example, IBM Runbook Automation tools helps IT Operations teams to simplify and automate repetitive tasks. Runbooks are typically created by technical writers working for top tier managed service providers. They include procedures for every anticipated scenario, and generally use step-by-step decision trees to determine the effective response to a particular scenario.

Wednesday, May 23, 2018

Build Highly Resilient Web Services

Digitization has led to new business models that rely on web services. Digital banks, payment gateways & other Fintech services are now available only on web. These web services need to be highly resilient with uptime of greater than 99.9999%

Building such high resilient Web services essentially boils down to seven key components:

High Resilient IT Infrastructure:
All underlying IT infrastructure (Compute, Network & Storage) is running in HA mode. High availability implies node level resilience and site level resilience. This ensures that a node failure or even a site failure does not bring down the web services.

Data Resilience:
All app related data is backed up in timely snapshots and also replicated in real time in multiple sites - so that data is never lost and RPO, RTO is maintained at "Zero"
This ensures that Data Recovery site is always maintained as an active state.

Application Resilience:
Web Applications have to be designed for high resilience. SOA based web apps, container apps are preferred than large monolith applications.

Multiple instances of the application should be run behind a load balancer - so that workload gets evenly distributed. Load balancing can also be done across multiple sites or even multiple cloud deployments to ensure web apps are always up and running.

Application performance monitoring plays an important role to ensure apps are available and performing as per required SLA. Active Application Performance Management is needed to ensure customers have good web experience.

Security Plan:
Security planning implies building in security features into the underlying infrastructure, applications & data. Security plan is a mandatory and must be detailed enough to pass security audits and all regulatory compliance requirements.
Software-Defined-Security is developed based on this security plan and this helps avoid several security issues found in operations.
Security plan includes security policies like: encryption standards, access control, DMZ etc.

Security operations:
Once the application is in production, the entire IT infrastructure stack must be monitored for security. There are several security tools for: Autonomous Watchdogs, Web Policing, web intelligence, continuous authentication, traffic monitoring, endpoint security & user training against phishing.
IT security is always an ongoing operation and one must be fully vigilant of any security attacks, threats or weaknesses.

IT Operations Management:
All web services need constant monitoring for Availability & Performance. All IT systems that are used to provide a service must be monitored and corrective actions, proactive actions need to be taken in order to keep the web applications running.

DevOps & Automation:
DevOps & automation is a lifeline of web apps. DevOps is used for all system updates to provide a seamless, non disruptive upgrades to web apps. DevOps also allows new features of web apps be tested in a controlled ways - like exposing new versions/capabilities to select group of customers and then using that data to harden the apps.

Closing Thoughts

High resilient apps are not created by accident. It takes a whole lot of work and effort to keep the web applications up and running at all times. In this article, I have just mentioned 7 main steps needed to build high resilience web applications - but there are more depending on the nature of the application and business use cases, but these seven are common to all types of applications.

Tuesday, May 22, 2018

5 Aspects of Cloud Management

If you have to migrate an application to a public cloud, then there are five aspects that you need to consider first before migrating.

1. Cost Management
Cost of public cloud service must be clearly understood and charge back to each application must be accurate. Lookout for hidden costs and demand based costs - as these can burn a serious hole in your budgets.

2. Governance & Compliance
Compliance to regulatory standards is mandatory. In addition you may need additional compliance requirements. Service providers must proactively adhere to these standards.

3. Performance & Availability
Application performance is the key. Availability/Up time of underlying infrastructure and performance of IT infrastructure must be monitored continuously. In addition, application performance monitoring both direct methods and via synthetic transactions is critical to know what customers are experiencing

4. Data & Application Security
Data security is a must. Data must be protected against data theft, Data loss, data unavailability. Applications must also be secured from unauthorized access and DDoS attacks. Having an active security system is a must for apps running on cloud.

5. Automation & Orchestration
Automation for rapid application deployment via DevOps, rapid configuration changes and new application deployment is a must. Offering IT Infrastructure as code enables flexibility for automation and DevOps. Orchestration of various third party cloud services and ability to use multiple cloud services together is mandatory.

Monday, May 21, 2018

AI for IT Infrastructure Management

AI is being used today for IT Infrastructure management. IT infrastructure generates lots of telemetry data from sensors & software that can be used to observe and automate. As IT infrastructure grows in size and complexity, standard monitoring tools does not work well. That's when we need AI tools to manage IT infrastructure.

Like in any classical AI system, IT infrastructure management systems also has 5 standard steps:

1. Observe:
Typical IT systems collect billions of data sets from thousands of sensors, collecting data every 4-5 minutes. I/O pattern data is also collected in parallel and parsed for analysis.

2. Learn:
Telemetry data from each device is modeled along with its global connections, and system learns each device & application stable, active states, and learns unstable states. Abnormal behavior is identified by learning from I/O patterns & configurations of each device and application.

3. Predict:
AI engines learn to predict an issue based on pattern-matching algorithms. Even application performance can be modeled and predicted based on historical workload patterns and configurations

4. Recommend:
Based on predictive analytics, recommendations are be developed based on expert systems. Recommendations are based on what constitutes an ideal environment, or what is needed to improve the current condition

5. Automate:
IT automation is done via Run Book Automation tools – which runs on behalf of IT Administrators, and all details of event & automation results are entered into an IT Ticketing system

Monday, May 07, 2018

Product Management - Managing SaaS Offerings

If you are a product manager of a SaaS product, then there additional things one needs to do to ensure a successful customer experience - Manage the cloud deployment.

Guidelines to choosing the best data center or cloud-based platform for a SaaS offering

1. Run the latest software.

In the data center or in the IaaS cloud, have the latest versions of all supporting software: OS, hyper visors, Security, core libraries etc., Having the latest software stack will help build the most secure ecosystem for your SaaS offerings.

2. Run on the latest hardware.

Assuming you're running on your data center, run the SaaS application on the latest servers - like HPE Proliant Gen-10 servers to take advantage of the latest Intel Xeon processors. As of mid-2018, use servers running the Xeon E5 v3 or later, or E7 v4 or later. If you use anything older than that, you're not getting the most out of the applications or taking advantage of the hardware chipset.

3. Optimize your infrastructure for best performance.

Choose the VM sizing (vCPU & Memory) for the best software performance. More memory almost always helps. Yes, memory is the lowest hanging of all the low-hanging fruit. You could start out with less memory and add more later with a mouse click. However, the maximum memory available to a virtual server is limited to whatever is in the physical server.

4. Build Application performance monitoring into your SaaS platform

In a cloud, application performance monitoring is vital in determining customer experience. Application performance monitoring had to be from a customer perspective - i.e., how customers experience the software.

This implies constant Server, Network, Storage performance monitoring, VM monitoring, application performance monitoring via synthetic transactions.

Application performance also determines the location of cloud services. If customers are in East coast - then servers/datacenters should be in east coast. Identify where customers are using the software and locate the data centers closer to customer, to maximize user experience.

5. Build for DR and Redundancy

SaaS operation must be available 24x7x365. So every component of SaaS platform must be designed for high availability (multiple redundancy) and active DR. If the SaaS application is hosted on big name-brand hosting services (AWS, Azure, Google Cloud etc) then opt for multi-site resilience with auto fail over.

6. Cloud security

Regardless of your application, you'll need to decide if you'll use your cloud vendor's native security tools or leverage your own for deterrent, preventative, detective and corrective controls. Many, though not all, concerns about security in the cloud are overblown. At the infrastructure level, the cloud is often more secure than private data centers. And because managing security services is complex and error-prone, relying on pre-configured, tested security services available from your cloud vendor may make sense. That said, some applications and their associated data have security requirements that cannot be met exclusively in the cloud. Plus, for applications that need to remain portable between environments, it makes sense to build a portable security stack that provides consistent protection across environments.

Hybrid SaaS Offering

Not all parts of your SaaS application can reside in one cloud. There may be cases where your SaaS app runs on one cloud - but pulls data from other cloud. This calls for interconnect between multiple cloud services from various cloud providers.

In such hybrid environment, one need to know how apps communicate and how to optimize such a data communications. Latency will be a critical concern and in such cases, one needs to build in a cloud interconnect services into the solution.

Cloud Interconnect: Speed and security for critical apps

If the SaaS App needs to access multiple cloud locations, you might consider using a cloud interconnect service. This typically offers lower latency and when security is a top priority, cloud interconnect services offer an additional security advantage.

Closing Thoughts

SaaS offerings has several unique requirements and needs continuos improvements. Product managers need to make important decisions about how the applications are hosted in the cloud environment and how customers experience it. Making the right decisions gives the results for a successful SaaS offering.

Finally, measure continuously. Measure real-time performance, after deployment, examining all relevant factors, such as end-to-end performance, user response time, and individual components. Be ready to make changes if performance drops unexpectedly or if things change. Operating system patches, updates to core applications, workload from other tenants, and even malware infections can suddenly slow down server applications.

Tuesday, December 05, 2017

What will Drive Server Refresh

Friday, June 02, 2017

Managing Big data with Intelligent Edge

The Internet of Things (IoT) is nothing short of a revolution. Suddenly, vast numbers of intelligent sensors and devices are generating vast amounts of data that contain potentially game-changing information.

In traditional, data analytics, all the data is shipped to a central data warehouse for processing in order to get strategic insights, like all other Big data projects, tossing large amounts of data of varying types into a data lake to be used later.

Today, most companies are collecting data at the edge of their network : PoS, CCTV, RFID scanners, etc. IoT data being churned out in bulk by sensors in factories, warehouses, and other facilities. The volume of data generated on the edge is huge and transmitting this data to a central data center and processing it in a central data center turns out to be very expensive.

The big challenge for IT leaders is to gather insights from this data rapidly, while keeping costs under control and maintaining all security & compliance mandates.

The best way to deal with this huge volume of data is to process this data right at the edge - near the point where data generated.

Advantages of analyzing data at the edge

To understand, lets consider a factory. Sensors on a drilling machine that makes engine parts - generates hundreds of bits of data each second. Over time, there are set patterns of data. Data showing vibrations, for example - it could be an early sign of a manufacturing defect about to happen.

Instead of sending the data across a network to a central data warehouse - where it will be analyzed. This is costly and time consuming. By the time the analysis is completed and plant engineers are alerted, there may be several defective engines already manufactured.

In contrast, if this analysis was done right at the site, plant managers could have taken corrective action before defect occurs. Thus, processing the data locally at the edge lowers costs while increasing productivity.

Also keeping data locally improves security and compliance. As all IoT sensors - could potentially be hacked & compromised. If data from a compromised sensor makes its way to the central data warehouse, the entire data warehouse could be at risk. Avoiding data from traveling across a network prevents malware from wreaking the main data warehouse. If all sensor data is locally analyzed, then only the key results can be stored in a central warehouse - this reduces cost of data management and avoid storing useless data.

In case of banks, the data at the edge could be Personally Identifiable Information (PII), which is bound by several privacy laws and data compliance laws, particularly in Europe.

In short, analyzing data on the edge - near the point where data is generated is beneficial in many ways:

Analysis can be acted on instantly as needed.
Security & compliance is enhanced.
Costs of data analysis are lowered.

Apart from these above mentioned obvious advantages, there are several other advantages:

1. Manageability:

It is easy to manage IoT sensors when they are connected to an edge analysis system. The local server that runs data analysis can also be used to keep track of all the sensors, monitor sensor health, and alert administrators if any sensors fail. This helps in handling a wide plethora of IoT devices used at the edge.

2. Data governance:

It is important to know what data is collected, where it is stored and to where it is sent. Sensors also generate lots of useless data that can be discarded or compressed or eliminated. Having an intelligent analytic system at the edge - allows easy data management via data governance policies.

3. Change management:

IoT sensors and devices also need a strong change management( Firmware, software, configurations etc.). Having an intelligent analytic system at the edge - enables all change management functions to be off loaded to the edge servers. This frees up central IT systems to do more valuable work.

Closing Thoughts

IoT presents a huge upside in terms of rapid data collection. Having an intelligent analytic system at the edge gives a huge advantage to companies - with the ability to process this data in real time and take meaningful actions.

Particularly in case of smart manufacturing, smart cities, security sensitive installations, offices, branch offices etc. - there is a huge value in investing in an intelligent analytic system at the edge.

As conventional business models are being disrupted. Change is spreading across nearly all industries, and organizations must move quickly or risk being left behind their faster moving peers. IT leaders should go into the new world of IoT with their eyes open to both the inherent challenges they face and the new horizons that are opening up.

Its no wonder that a large number of companies are already looking to data at the edge.

Hewlett Packard Enterprise makes specialized servers called Edgeline Systems - designed to analyze data at the edge.

Tuesday, April 18, 2017

20 Basic ITIL Metrics

ITIL breaks major IT functions down into nice bite sized processes — ripe to be measured with metrics. Here are 20 of our favorite metrics for ITIL processes:

Incident and Problem Management

1. Percentage of Incidents Resolved by First Level Support
Support costs can be dramatically reduced when first line support resolves basic issues such as user training, password problems, menu navigation issues etc... The target for this metric is often set above 80%.

2. Mean Time to Repair (MTTR)
Average time to fix an incident. Often the most closely watched ITIL related metric. It is not unusual for MTTR reporting to go to CxO level executives.

3. Percentage of Incidents Categorized as Problems
The percentage of incidents that are deemed to be the result of problems.

4. Problems Outstanding
The total number of problems that are unresolved.

Service Desk

5. Missed Calls
The number of times someone called the help desk, were put on hold and eventually hung up. May include the number of times someone called when the help desk was closed. Impacts customer service and core metrics such as MTTR.

6. Customer Satisfaction
Usually captured in a survey. Try not to go overboard: asking for feedback in an inappropriate way can irritate customers.

7. Staff Turnover
Service Desk jobs can be stressful — retaining experienced staff is critical to optimizing core ITIL metrics.

Change Management

8.Number of Successful Changes (change throughput)
Change throughput is a good measure of change management productivity.

9. Percentage of Failed Changes
A change management quality metric — can impact customer satisfaction and availability management.

10. Change Backlog
Total number of changes waiting in the queue.

11. Mean RFC (Request for Change) Turnaround Time (MRTT)
The average time it takes to implement a change after it is requested.

Release Management

12. Percentage of Failed Releases
The percentage of releases that fail — a key Release Management quality metric.

13. Total Release Downtime (TRD)
Total downtime due to release activity.

Availability Management

14. Total Downtime
Total downtimes broken down by service.

15. Total SLA Violations
Number of times that the availability terms laid out in SLAs were violated.

IT Financial Management

16. Percentage of Projects Within Budget
The percentage of projects that did not go over/under their prescribed budget.

17. Total Actual vs Budgeted Costs
Total actual project costs as a percentage of budgeted project costs. Calculated for an entire project portfolio. A number over 100% indicates over spending.

Service Level Management

18. Total SLA violations
The number of SLA violations in a given period.

19. Mean Time to Resolve SLA Violations
The average time it takes to restore SLA compliance when a violation occurs.

Configuration Management

20. CI Data Quality
Percentage of CIs with data issues. Can be determined by sampling methods.