Thursday, May 26, 2011

Capacity Management for Cloud Services

When setting up a cloud infrastructure, it is essential to plan for a desired capacity - how many servers, how much storage and what is the network bandwidth. As cloud services are to be designed for elastic demands - capacity planning becomes quite tricky.

Fundamentally capacity planning is driven by three major parameters:

1. Economics of Cloud Computing: Clouds must be cheaper than hosted environments
2. Performance Requirements: Hosted applications must meet certain performance targets
3. Ability to meet elastic demands: Ability to take on extra work loads without compromising on service quality.

The cloud capacity decision is also influenced by:

1. Security & Reliability Requirements: The DR (Disaster Recovery) strategy has a huge impact on the required capacity. In addition data security also impacts the capacity requirements.
2. Software Requirements: Depending on the type of software that is deployed on the cloud, the underlying capacity of the infrastructure needs to be determined.
3. Time to deploy additional resources: It takes time to bring in additional servers into the cloud. During that time, the existing cloud capacity must be able to handle additional loads. This is also called as buffer capacity.

Capacity Planning: Initial steps

The goal of capacity planning is to ensure that you always have sufficient but not excessive resources to meet customers' needs in a timely fashion.

This requires three main details:

1. Get a measure of what is going on in the cloud. This entails getting information on customers' current workloads.

2. Forecast the future workloads. How many new clients will be added, and their projected work loads.

3. Develop a model that will meet the customer requirements - considering the response times or Salsa; for the lowest possible cost.

The actual capacity planning is an iterative process. In each iteration, you need to validate your assumptions with the actual data and then constantly revalidate your model.

For example, if the current load per customer is (say) 5000 transactions per minute, and the customer's estimate for next year is 10000 transactions per minute, then you need to constantly validate this projections for the next round capacity planning.

As this is a model - based on human intelligence and guess work, one needs to be cautious in making certain assumptions. It is not always possible to get a full measure of all the current workloads of all customers. Therefore a representative sample size of customers must be taken for all calculations.

The model has to be verified over a range of real-life conditions, and become useful for prediction.

Queue Time & Utilization

All capacity planning is based on queuing theory, which was originally developed by Agner Krarup Erland in 1909. This theory explains the powerful relationship between the resource utilization and response times. The theory was originally used to plan the capacity of telephone switches and continues to be used for designing routers in the packet switched networks. The same theory is applicable to the cloud capacity planning as well.

Queues form because resources are limited and demand fluctuates. If there was only one computer and only one workload, then the computer is servicing that workload. Suppose the work load is finished before the next workload arrives, then the computer becomes idle - thus unutilized. On the other hand, if a new work load arrives before the current workload is finished, then the new workload will have to wait in a queue for its turn. In this process the computer is now being utilized, so as the number of workloads exceed the computer gets busier and busier - and the utilization of the resources increases, and the wait time increases exponentially. The classic queue length (response time) versus utilization curve looks like the curve shown in Figure-1.














Now consider this:

Response Time = Waiting Time (Queuing Time) + Processing Time
and
Waiting Time = Time to Service One Request × Queue Length

Now, consider a case where a SQL query takes 100 millisecond to complete. If the queue is empty, then the query will be serviced in 100 milliseconds. But if there are 40 queries ahead in the queue, then it will take 100x40 +100= 4100 milli seconds. The wait time increases exponentially. Suppose there were two servers, then the time drops to 2100 milliseconds, if there were 4 servers then the time drops to 1100 milliseconds. So as you can see increasing the number of resources causes a dramatic reduction in wait times.

In reality, all computers are not serving all the time. There will be times when computers are idle waiting in workloads. Say a computer is busy 98% of the time, then adding another server will make both the servers busy only 49% of the time - i.e., 51% of the time the servers are idle, consuming power and the resources are wasted. Adding resources will definitely decrease the wait times, but it also results in wastage. From the utilization curve, we can see that the kneel point happens at 65% utilization - i.e., increasing utilization above 65% increases wait times exponentially.

In a cloud system there are several resources to be considered - CPU, disks, storage, network latency, bandwidth etc., and there is an associated wait times for each of the resource. Proper planning of a cloud infrastructure will need allocating proper resources for a overall customer experience. Also in a multi-tenancy system, the customer demands are not in sync - therefore service providers can move certain resources from one customer to another customer depending on customer work loads.

Closing Thoughts

Cloud computing is a capacity planner' dream. The beauty of cloud based resources are that you can dynamically and elastically expand and contract the resources reserved for your use, and you only pay for what you have reserved.

Also see:

Wednesday, May 25, 2011

BASIC REQUIREMENTS OF A CLOUD COMPUTING SERVICE

Cloud computing is here to rule. Right now, most of the small, medium enterprises have gone 100% on cloud. I have seen several startups - which are using cloud services for all their computing needs. But large enterprises are reluctant to move to cloud services and rightly so. Many companies are just testing waters and have held back on full scale deployment of cloud IT services.

Today the cloud services have several deficiencies - which from an enterprise prespective are the basic requirements for them to consider cloud services. In this article I have written about about 6 basic requirements for enterprises to adapt cloud services in a big way.

1. Availability - with loss less DR

Customers want their IT services be up and available at all times. But in reality, computers sometimes fail. This implies that the service provider should have implemented a reliable disaster recovery (DR) mechanism - where in the service provider can move the customer from one data center to another seamlessly and the customer does not even have to know about it.

As a cloud service provider, there will be enormous pressure to minimise costs by optimally utilizing all the IT infratrucrture. The traditional Active-Passive DR strategy is very expensive and cost ineffecient. Instead, service providers will have to create an Active-Active disaster recovery mechanism - where more than one data center will be active at all times and ensures that the data and services can be accessed by the customer from either of the data centers seamlessly.

Today, there are several solutions that are available to do just that. EMC VPLEX solution to maintain an Active-Active data center. Another approach will be implement Hadoop/Hive stack for data intensive applications such as emails, messaging, data store, services.

In an ideal senario, the customer on the cloud services should not even notice any change at all and the movement of all his data & applications from one data center to another must be transparent to the end user.

2. Portability of Data & Applications

Customers hate to be locked into a service or a platform. Ideally a cloud offering must be able to allow customers to move out their data & applications from one service provider to another - just like customers can switch from one telephone service provider to another.

As applications are being written on standard platforms - Java, PHP, Python, etc. It should be possible to move the customer owned applications from one service provider to another. Customers should also take care to use only the open standards and tools, and avoid vendor specific tools. Azure or Google services offers several tools/applications/utilities which are valuable - but it also creates a customer lockin - as the customer who uses these vendor specific tools cannot migrate to another service provider without rewriting the applications.

To illustrate this, today in India, customers can move from one cell phone service provider to another without changing thier handsets, but in US, if one were to move from AT&T to Verizon, one needs to pay for the handset - which forms a customer lock in instrument.

With public cloud services, customers should be able to move their data & applications from one cloud to another - without distrupting the end user's IT services. This movement should be transparent to the end user.

The Cloud Computing Interoperability Forum (CCIF) was formed by organizations such as Intel, Sun, and Cisco in order to enable a global cloud computing ecosystem whereby organizations are able to seamlessly work together for the purposes for wider industry adoption of cloud computing
technology. The development of the Unified Cloud Interface (UCI) by CCIF aims at creating a standard programmatic point of access to an entire cloud infrastructure.

Recently in EMC world 2011, EMC demonstrated moving several active VMs & applications from EMC data center to CSC data center without disruption of service. This was just a proof of concept, but to make this a common place, some amount of regulation and business coordination will be required.

However, in their current form, most of cloud computing services and platforms do not employ standard methods of storing user data and applications. Consequently, they do not interoperate and user data are not portable.

3. Data Security

Security is the key concern for all customers - since the applications and the data is reciding in the public cloud, it is the responsibility of the service provider for providing adequate security. In my opinion security for customer data/applications becomes a key differentiator when it comes to selecting the cloud service provider. When it comes to IT security, customers tend to view the cloud service providers like they view banks. The service provider is totally responsible for user security, but there are certain responsibilities that the customer also needs to take.

The service provider must a robust Information Security Risk Management process - which is well understood by the customer, and customer must clearly know his responsibilities as well. As there are several types of cloud offerings (SaaS, PaaS, IaaS etc), there will be different sets of responsibility for the customer and the service provider depending on the cloud service offering.

When it comes to security, the cloud service providers offer better security than what the customer's own data center security. This is a kin to banks - where banks can offer far greater security than any individual or company. The security in cloud is much higher due to: Centralized monitoring, enhanced incidence detection/forencics, logging of all activity, greater security/venerability testing, centralized authentication testing (aka password protection/ssurance), Secure builds & testing patches before deployment and lastly better security software/systems.

Cloud service providers know that the security is the key to their success - and hence invest more on security. The amount of efforts/money invested by cloud service providers will always be greater than the amount an individual company(most) can spend.
Security issues will also be addressed through legal & regulatory systems. Despite the best IT security, breaches can happen and when it happens, the laws and rules of the land - where the data resides play an important role. For example, specific cryptography techniques could not be used because they are not allowed in some countries. Similarly, country laws can impose that sensitive data, such as patient health records, are to be stored within national borders. Therefore customer needs to pay attendtion to Legal and regulatory issues when selecting the service providers.

4. Manageability

Managing the cloud infrastructure from the customer prespective must be under the control of the customer admin. Customers of Cloud services must be able to create new accounts, must be able to provision various services, do all the user account monitoring - monitoring for end user usage, SLA breaches, data usage monitoring etc. The end users would like to see the availability, performance and configuration/provisioning data for the set of infrastructure they are using in the cloud.

Cloud service provider will have various management tools for Availability management, performance mangement, configuration management and security management of applications and infrastructure(storage, servers, and network). Customers want to know how the entire infrastructure is being managed - and if possible can that management information be shared with them, and alert the customer on any outage, slow service, or breach of SLA as it happens. This allows customer to take corrective actions - either move the applications to another cloud or enable their contigency plans.

Sharing the application performance and resource management information will help improve utilization and consequently optimize usage by customers. This will result in improving ROI for the customers and encourage customers to adapt cloud services.

As customers buy cloud services from multiple vendors, it will become a necessity to have a unified management system to manage all the cloud services they have. This implies that cloud service providers must embrace an XML based reporting formats to provide management information to customers and customers then can build their own management dashboards.

5. Elasticity

Customer on Cloud computing have a dynamic computing loads. At times of high load, they need greater amount of computing resources available to them on demand, and when the work loads are low, the computing resources are released back to the cloud pool. Customer expect the service provider to charge them for what they have actually used in the process.

Customers also want a self service on-demand resource provisioning capability from the service provider. This feature enables users to directly obtain services from clouds, such as spawning the creation of a server and tailoring its software, configurations, and security policies, without interacting with a human system administrator. This eliminates the need for more time-consuming, labor-intensive, human driven procurement processes familiar to many in IT.

This implies that the dynamic provisioning system should be the basic part of cloud management software - through which users can easily interact with the system.

To provide an elastic computing resources, the service provider must be able to dynamically provision resources as needed and have adequate charge back systems to bill the customer.

In reality, it may not be possible for any single cloud service provider to build an infinitely scalable infrastructure and hence customers will have to rely on a fedrated system of multiple cloud service providers sharing the customer loads. (Just like a power grid, where the load gets distributed to other power plants during peak loads)

6. Federated System

There are several reasons as to why customers will need a Federated cloud system. Customers may have to buy services from several cloud service providers for various services - email from Google, online sales transaction services from Amazon and ERP from another vendor etc. In such cases customer want their cloud applications to interact with other other services from several vendors to provide a seamless end to end IT services.

This implies that each of the cloud services must have an interface with other cloud services for load sharing & application interoperability.

In a federated environment there is potentially an infinite pool of resources. To build such a system, there should be inter-cloud framework agreements between mupliple service providers, and adequate chargeback systems in place.

Having a federated system helps customers to move their data/applications across different cloud service providers and prevents customer lockin.

Interoperability of applications across different cloud services has led to creations of standard APIs. But these APIs are cumbersome to use and that has led to creation of Cloud Integration Bus - based on Enterprise Service Bus (ESB).

As on today, the integration issues are still being worked out, and there is no universal standards for creating interop between different cloud applications.

Closing Thoughts

Cloud services are still in its infancy and if cloud services were to attract large enterprise customers, then they need to do a lot more than today to address data/application portability, federated scalable system, complete end-to-end interoperability and security issues.

Watch this space as I will write more about cloud computing from business and management point of view.