Sunday, December 25, 2011

Challenges in Operations Management of Virtual Infrastructure

Corporate IT has embraced virtualization as a means to save on costs, modernize their infrastructure, and offer a greater range of services. Virtualizations resulted in consolidation of resources & workloads - that led to productivity gains, improved efficiency of application performance and IT infrastructure and reduced operating costs.

Virtualization broke the traditional silos of dedicated computing resources for specific applications and also broke the silo's of operations management by forcing IT administrators to look at servers (compute), networks and storage as unified resource pool.

This has created new operations management challenges - which cannot be solved by traditional approaches. The new challenges are mostly in the following areas:

  1. Virtualization eliminated the IT silo boundaries between Applications, network, compute and storage. This has made the IT stack more sensitive to changes in interference to the components in the IT stack. For example changing a network setting could have an adverse impact on the applications, or data store speeds. Changes in storage configuration could result in unresponsive applications etc.

  2. Virtualization can easily push the resource utilization beyond the safe operating boundaries. Thus causing random performance issues or causing random hardware failures.

  3. Applications running in a virtualized environment will see a dynamic changes in resources available for it. As one application starts to consume more resource, the other applications could see a similar reduction in resources available for it. This causes a random performance variations and in many cases disrupt the entire business operations.

Managing Virtualized infrastructure needs new tools and technologies to handle these new factors of complexity. Given the dynamic nature of a virtualized IT Infrastructure, the new management tools must be: Scalable, Unified, Automated, Proactive and User friendly.

It is also very important to ensure that the cost of virtual infrastructure management tools must be lower than the cost of failures. Though this sounds simple, in reality, the cost of infrastructure management can potentially raise to the sky - so one needs to be cautious on choosing the right set of tools.

Traditional Operations Management

Ever since the beginning of IT operations, management of the IT infrastructure has been organized based on resource silos. A dedicated team to manage:

1. Servers - Physical machines & Operating systems
2. Network - LAN, VLAN, WAN
3. Storage - SAN, NAS, Storage Arrays
4. Applications- CRM, ERP, Database, Exchange, Security etc. In large organizations there are teams to manage each application types.

Each of the resource management silos operated independently, had their own operations management: Monitor, analyze, control and change resources. Each team has its own set of tools, processes, procedures to manage the resources that come under their preview.

Since each group had little idea of the needs and requirements of other groups, they often created excessive capacities to handle growing business needs and also for peak loads.

This silo based approach led to inefficiencies and wastage. Virtualization eliminates such wastages and improves efficiency.

Virtualization disrupted Operations Management

Virtualization is a game changer for operations management. Virtualization elements the boundary between compute, storage and network resource silos, and views the entire IT resources as a single pool.

The hypervisor shares the physical resources into Virtual Machines (VM) that can process workloads. This resource sharing architecture dramatically improves resource utilization and allows for flexible scaling of workloads and resources available for those workloads.

Virtualization creates new operations management challenges by:

1. Virtual Machines share the physical resources. So when one VM increases the resource usage, it will impact the performance of applications running on other VM's that share the same resource. This interference can be random & sporadic - leading to complex performance management challenges.

2. Hypervisor has a abstract view of the real physical infrastructure. Often times the real capacity of the underlying infrastructure is not what is viewed by hypervisor, as a result when new VM's are added, it will create under-provisioning of resources and create major performance bottlenecks.

3. Hypervisor allows for consolidation of workload streams to get a higher resource utilization. But if the workloads are correlated, i.e., as increase in one workload creates a corresponding increase on an other workload, then their peaks will become compounded and the system will run out of resources or/and create enormous bottlenecks.

4. VM's need to have dynamic resource allocation in-order for the applications to meet the performance & SLA requirements. This dynamic resource allocation requires an active and automatic resource management.

5. Hypervisor has a abstract view of the real physical infrastructure. As a result, the configuration management appears to be overly simple at the hypervisor layer - but in reality, the configuration changes will have to be coordinated across different resource types (compute, network, storage).

6. Virtualization removes the silo boundaries across the resource types (compute, network & storage). This creates cross-element interference on the applications. So when an application fails to respond, the root cause for the failure cannot be easily identified.

Virtualization creates a new set of operations management challenges, but the solution to these challenges will result in a seamless, cross-domain management solutions will reduce costs by automating various management functions and eliminate the costly cross-silo coordination between different teams. Managing a virtualized infrastructure will need automated solutions that will reduce the need for labor intensive management systems of today.

Virtualization and Utilization

The greatest benefit of Virtualization is in resource optimization. IT administrators were able to retire the old and inefficient servers and move the applications to a virtualized server running on newer hardware. This optimization helped administrators reduce the operating costs, reduce energy utilization, and increase utilization of existing hardware.

Cost saving achieved by server consolidation and higher resource utilization was a prime driver for virtualization. The problem of over-provisioning had led to low server utilization. With virtualization, utilization can be raised to as high as 80%.

The higher utilization rate may sound exciting, it also creates major performance problems.

As virtualization consolidates multiple work loads on a single physical server - thus increasing the utilization of that server. But work loads are never stable - work loads tend to have their peaks and lows. So if one or more work loads hits a peak, utilization can quickly reach 100% and create a grid lock for other work loads. Thus adversely affect performance. Severe congestion can lead to data losses and even hardware failures.

For example, Virtual machines typically use a virtual network: Virtual network interface, subnets, and bridging packages to map the virtual interfaces to the physical interfaces. If multiple VM are running on the server and the server has limited physical network interface, then running multiple VM's that are running network intensive applications can easily choke off the physical interface causing a massive congestion in the system. Similarly, such congestion can occur with CPU, memory or storage I/O resources as well.

The resource congestion problems could be intermittent and random, that makes it even more harder to debug and solve the resource contention issues.

To solve these performance problems, one has to find out the bottle neck for each of the performance problems first.

In a virtualized environment, finding these performance bottlenecks is a big challenge as the symptoms of congestion would show up in one area - while the real congestion could be somewhere else.

In a non-virtualized world, the resource allocation were done in silos, such that each silo must accommodate for all fluctuations of work loads. This led to excessive capacity - by planning to handle the peak work loads. Therefore performance management was never a major issue. But with virtualization, active performance management is critical. The virtual infrastructure must be constantly monitored for performance and corrective actions must be taken as needed - by moving out VM's from a loaded server to another lightly loaded server, or by dynamically provisioning additional resources to absorb the peaks.

Dynamic provisioning requires a deeper understanding of resource utilization: which application is consume what resource and when resources are being used. To understand this better consider this example:

In an enterprise there are several workloads, but few workloads have a marked peak behavior. Sales system has a peak demand between 6PM-9PM, HR systems has a peek demand between 1PM to 5PM, Inventory management system has a peak demand between 9 AM to 1 PM. On further analysis, it is found that the sales system actually has a peak demand on the network and storage IOPS, ERP system has a peak demand on servers and storage, HR systems has a peak demand on servers and storage IOPS.

Knowing this level of detail will help the system administrators to provision additional VM's for ERP by moving VM's allocated to HR between 9 AM to 1PM. While VM's allocated to ERP can be moved to HR between 1PM-5PM.

Solving the sales peak load problem may require additional networking hardware and more bandwidth - which will result in a lower utilization. It is better to have the excess capacity wasted during off-peak times than having performance bottlenecks.

There could be other complex cases: The HR system creates multiple random writes, while the sales system is issuing a series of sequential reads, then in such case the sales application will see a delay or performance degradation even though the workloads are normal. In this case the SAN network gets choked with writes from the HR system and the performance problem will be reported by the sales application administrator.

Resolving such correlated workload performance issues requires special tools that provide deeper insight into the system. Essentially, the IT administrators must be able to map the application to the resources it uses and then monitor the entire data path for performance management.

Fundamental Operations Management Issues with Virtualization

Virtualization creates several basic system management problems. These are new problems and these cannot be solved by silo based management tools.

  1. Fragmented Configuration Management

    Configuration & provisioning tools are still silo based, there are separate tools for server configuration, network configuration and storage configuration. In organizations, this has led to fragmented configuration management - which is not dynamic or fast enough to meet the demands of virtualization.

  2. Lack of Scalability in monitoring tools

    Fault and performance monitoring tools are still silo based and as the infrastructure gets virtualized, the number of virtual entities increase exponentially. Also the number of virtual entities are dynamic and vary with time. The silo domain based management tools are intrinsically non-scalable for the virtual system.

  3. Hardware Faults due to high utilization

    Virtualization leads to higher resource utilization - which often stress the underneath hardware beyond the safe operating limits - which eventually causes hardware failures. Such high utilization cannot be detected by current monitoring systems, and administrators are forced to do breakdown repairs.

  4. Hypervisor complexities

    Typical virtualization environment will have multiple virtualization solutions: Vmware, Xen etc. The hypervisor mechanisms itself create management problems. (see: http://communities.vmware.com/docs/DOC-4960 ) Having multi-vendor approach towards virtualization will increase hypervisor management complexity.

  5. Ambiguity

    Performance issues arising in a virtualized environment are often ambiguous. Faults or bottleneck seen in one system may have the root cause in another system. This makes it mandatory to have a complete cross-domain (compute, network, storage) management tools to find the root cause issues.

  6. Interference

    VM workloads share resources. As a result increasing one workload can create interference to the performance of another workload. These interference problems are very difficult to identify and manage.

Closing Thoughts

Virtualization is great to save costs and improve resource utilization. However, one may have to fundamentally change the way the IT infrastructure is managed with virtualization. New workflows will have to be developed and new management tools will be needed.

No comments: