Saturday, February 11, 2012

Application Performance issues in Cloud



Cloud computing is changing the IT services are being delivered to a global, mobile workforce. The initial promise of cloud computing was the ease of service delivery over Internet, simplified service management - was sold as cost saving to enterprise IT departments.

Companies went ahead in their cloud based service delivery over a federated cloud computing environment, where some applications was hosted on public cloud, some applications on private cloud, and some legacy applications accessible over VPN. As companies rolled out their cloud strategy - they are seeing new challenges in application management.

IT departments now started to see intermittent performance problems on certain applications being reported by few employees - while others faced no problems. IT administrators are facing myriad set of problems and trouble ticket volumes increasing with bigger deployments of cloud services. This has led to temporary pause of cloud deployments in many occasions. IT departments are now seeing new challenges in meeting their internal SLA or even to ensure a minimum acceptable quality of service.

Only real way to manage this is to have a performance management system that monitors performance of all the applications - legacy, private cloud and public cloud.

The only way to effectively meet these customer requirements is to provide a single integrated dashboard that monitors and manages application performance from user's perspective.

The SLA Challenge - Availability

According to an IDC survey done in 2009, the major concerns for cloud adoption were: Security, Availability & Performance.

The main challenge for cloud adoption is that cloud service providers are not willing to sign up for a standard SLA. Each cloud service provider has their own interpretations of SLA and in a federated environment, it becomes impossible to collate every SLA and create a base line SLA for the end users. Even by 2012, these challenges will remain.

For example Amazon's EC2 defines outage as downtime greater than 5 minutes, and promises 99.95% availability. This could also mean that 99.95% of the time, the outage was less than 5 minutes. This SLA does not give a clear picture of the total outage time per year.

The problem gets compounded when more than one service provider is involved. For example the Internet Service Provider for the enterprise could have another SLA, and ISP for the user could have a different SLA. This effectively translates to an total outage times of:

Total Outage = Outage of Public Cloud provider + Outage of ISP1 + Outage of ISP2 ....

In case there are multiple cloud services which have interdependencies - then the outage time will be sum of all the outages.

The biggest problem in such an environment is that the user and the IT administrator has no visibility into these outages.

The SLA Challenge - Performance

Performance bottlenecks in the system can effectively stop a service deliver. From the user perspective poor performance and non-availability of service are same. Users will not tolerate slow performance - irrespective of where the problem is. The net impact to business is service outage leading to lost productivity.

Ensuring the minimum performance standards all across the cloud is almost impossible. The end user experience will vary based on geographic location. Not all Internet connections are equal - sometimes few zones could have network congestion leading to service outages - even when all the applications in the data center and public clouds are up.

Unpredictability of service outages and total lack of insight into the performance of cloud service delivery is the biggest pain for corporate IT service as illustrated below (an imaginary conversation)

Business User to IT Help desk : "I cannot access my corporate Inventory reporting system. I see that nothing is wrong with my Internet connections and I can access all other systems. I want to know if the inventory management system is down."

IT help desk to User: "Thanks for calling IT help desk. At the moment, all our indicators show that inventory management system is working well. Let me log a ticket for this case and we will investigate."
IT help desk to User on email: "We are not able to determine the reason for this problem. We will still continue to investigate."

User to IT help desk: "The problem is now resolved. I can access corporate Inventory reporting system. Please close the ticket."

The above illustration depicts the typical problem when mission critical services are moved to cloud and users cannot access it, and IT cannot troubleshoot it.

Closing Thoughts

Availability, Performance and Security are the three biggest problem areas in the cloud. The threat of data loss and legal challenges due to security breach is very high and most users and TI administrators are well aware of it.

Moving to cloud computing will also create additional management challenges and even with best of the breed solutions, IT departments will not be able to deal with availability and performance issues that emerge with cloud services being accessed by a global workforce. Companies will have to buy multi-site cloud services for redundancy as a possible solution - but that increases costs and management overheads.

At the moment, service availability and performance assurance over the cloud is still a challenge and will remain a challenge for some more time till tools and methods are developed to solve it.

No comments: