Tuesday, June 27, 2017

Key Metrics to measure Cloud Services

As a business user, if you are planning to host your IT workloads on a public cloud and you want to know how to measure the performance of the cloud service, here are seven important metrics you should consider.

1. System Availability

A cloud service must be available 24x7x365. However, there could be downtimes due to various reasons. This system availability is defined as the percentage of time that a service or system is available. Often expressed as a percentage. For example, a downtime of 7.5 hours unavailable per year or 99.9% availability! A downtime of few hours can potentially cause millions of dollars in losses.

365 or 3.65 days of downtime per year, which is typical for non redundant hardware if you include the time to reload the operating system and restore backups (if you have them) after a failure. Three nines is about 8 hours of downtime, four nines is about 52 minutes and the holy grail of 5 nines is 7 minutes.

2. Reliability or also known as Mean Time Between Failure (MTBF)  and Mean Time To Repair(MTTR)

Reliability is a function of two components: Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) - i.e., time taken to fix the problem. In the world of cloud services, it is important to know MTTR is often defined as the average time required to bring back a failed service back into production status.

Hardware failure of IT equipment can lead to a degradation in performance for end users and can result in losses to the business. For example, a failure of a hard drive in a storage system can slow down the read speed - which in turn causes delays in customer response times.

Today, most cloud systems are built with high levels of hardware redundancies - but this increases the cost of cloud service.        

3. Response Time

Response time is defined as the time it takes for any workload to place a request for
work on the cloud system and for the cloud system to complete the request. Response time is heavily dependent on the network latencies.

Today, If the user and the data center are located in the same region, the average overall response time is 50.35 milliseconds. When the user base and data centers are located in different regions, the response time increases significantly, to an average of 401.72 milliseconds.

Response Time gives a clear picture of the overall performance of the cloud. It is therefore very important to know the response times to understand the impact on application performance and availability - which in-turn impacts customer experience.

4. Throughput or Bandwidth

The performance of cloud services are also measured with throughput; i.e., Number of tasks completed by the cloud service over a specific period. For transaction processing systems, it is normally measured as transactions/second. For systems processing bulk data, such as audio or video servers, it is measured as a data rate (e.g., Megabytes per second).

Web server throughput is often expressed as the number of supported users – though clearly this depends on the level of user activity, which is difficult to measure consistently. Alternatively, cloud service providers publish their throughputs in terms of bandwidth - i.e., 300MB/Sec, 1GB/sec etc. This bandwidth numbers most often exceeds the rate of data transfer required by the software application.

In case of mobile apps or IoT, there can be a very large number of apps or devices streaming data to or from the cloud system. Therefore it is important to ensure that there is sufficient bandwidth to support the current user base.

5. Security

For cloud services, security is often defined as the set of control based technologies and policies designed to adhere to regulatory compliance rules and protect information, data applications and infrastructure associated with cloud computing use. The processes will also likely include a business continuity and data backup plan in the case of a cloud security breach.

Often times, cloud security is categorized into multiple areas: Security Standards, Access Control, Data Protection (Data unavailability & Data loss prevention), Network  - Denial of service (DoS or DDoS)

6. Capacity

Capacity is the size of the workload compared to available infrastructure for that workload in the cloud. For example, capacity requirements can be calculated by tracking average utilization over time of workloads with varying demand, and working from the mean to find the capacity to handle 95% of all workloads.  If the workloads increases beyond a point, then one needs to add more capacity - which increases costs.

7. Scalability

Scalability refers to the ability to service a theoretical number of users - degree to which the service or system can support a defined growth scenario.

In cloud systems, scalability is often mentioned as scalable up to tens of thousands, hundreds of thousands, millions, or even more, simultaneous users. That means that at full capacity (usually marked as 80%), the system can handle that many users without failure to any user or without crashing as a whole because of resource exhaustion. The better an application's scalability, the more users the cloud system can handle simultaneously.

Closing Thoughts

Cloud service providers often publish their performance metrics - but one needs to dive in deeper and understand how these metrics can impact the applications being run on that cloud. 


Aashi siva said...
This comment has been removed by a blog administrator.
Karthi Keyan said...
This comment has been removed by a blog administrator.
sandhosh said...

Thank you for this great article.keep more updates.
Digital marketing company in Chennai