Magic Effect of SIOC (VMware Storage Input Output Control) in vSphere 5 : A practical study.
Recently I got a chance to implement and test the Enterprise Plus feature in vSphere called SIOC. It is really an amazing feature…..
Application performance can be impacted when servers contend for I/O resources in a shared storage environment. There is a crucial need for isolating the performance of critical applications from other, less critical workloads by appropriately prioritizing access to shared I/O resources. Storage I/O Control (SIOC), a new feature offered in VMware vSphere 4 & 5, provides a dynamic control mechanism for managing I/O resources across VMs in a cluster.
Datacenters based on VMware’s virtualization products often employ a shared storage infrastructure to service clusters of vSphere hosts. NFS, ISCSI, and Storage area networks (SANs) expose logical storage devices (LUNs) that can be shared across a cluster of ESX hosts. Consolidating VMs’ virtual disks onto a single VMFS datastore, or NFS datastore, backed by a higher number of disks has several advantages—ease of management, better resource utilization, and higher performance (when storage is not bottlenecked).
With vSphere 4.1 we can only use SIOC with FC/ISCSI datastores, but with vSphere 5 we can use NFS.
However, there are instances when a higher than expected number of I/O-intensive VMs that share the same storage device become active at the same time. During this period of peak load, VMs contend with each other for storage resources. In such situations, lack of a control mechanism can lead to performance degradation of the VMs running critical workloads as they compete for storage resources with VMs running less critical workloads.
Storage I/O Control (SIOC), provides a fine-grained storage control mechanism by dynamically allocating portions of hosts’ I/O queues to VMs running on the vSphere hosts based on shares assigned to the VMs. Using SIOC, vSphere administrators can mitigate the performance loss of critical workloads during peak load periods by setting higher I/O priority (by means of disk shares) to those VMs running them. Setting I/O priorities for VMs results in better performance during periods of congestion.
So What is the Advantage for an VM administrator/Organization? Now a days, with vSphere 5 we can have 64TB of single LUN and with high end SAN with FLASH we are achieving more consolidation. That’s great !! but when we do this…. just like the CPU/MEMORY resource pools ensure the Computing SLA, the SIOC ensures the virtual disk STORAGE SLA and its response time.
In short, below are the benefits;
— SIOC prioritizes VMs’ access to shared I/O resources based on disk shares assigned to them. During the periods of I/O congestion, VMs are allowed to use only a fraction of the shared I/O resources in proportion to their relative priority, which is determined by the disk shares.
— If the VMs do not fully utilize their portion of the allocated I/O resources on a shared datastore, SIOC redistributes the unutilized resources to those VMs that need them in proportion to VMs’ disk shares. This results in a fair allocation of storage resources without any loss in their utilization.
— SIOC minimizes the fluctuations in performance of a critical workload during periods of I/O congestion, as much as a 26% performance benefit compared to that in an unmanaged scenario.
How Storage I/O Control Works
SIOC monitors the latency of I/Os to datastores at each ESX host sharing that device. When the average normalized datastore latency exceeds a set threshold (30ms by default), the datastore is considered to be congested, and SIOC kicks in to distribute the available storage resources to virtual machines in proportion to their shares. This is to ensure that low-priority workloads do not monopolize or reduce I/O bandwidth for high-priority workloads. SIOC accomplishes this by throttling back the storage access of the low-priority virtual machines by reducing the number of I/O queue slots available to them. Depending on the mix of virtual machines running on each ESX server and the relative I/O shares they have, SIOC may need to reduce the number of device queue slots that are available on a given ESX server.
It is important to understand the way queuing works in the VMware virtualized storage stack to have a clear understanding of how SIOC functions. SIOC leverages the existing host device queue to control I/O prioritization. Prior to vSphere 4.1, the ESX server device queues were static and virtual-machine storage access was controlled within the context of the storage traffic on a single ESX server host. With vSphere 4.1 & 5, SIOC provides datastore-wide disk scheduling that responds to congestion at the array, not just on the hostside HBA.. This provides an ability to monitor and dynamically modify the size of the device queues of each ESX server based on storage traffic and the priorities of all the virtual machines accessing the shared datastore. An example of a local host-level disk scheduler is as follows:
Figure 1 shows the local scheduler influencing ESX host-level prioritization as two virtual machines are running on the same ESX server with a single virtual disk on each.
Figure 1. I/O Shares for Two Virtual Machines on a Single ESX Server (Host-Level Disk Scheduler)
In the case in which I/O shares for the virtual disks (VMDKs) of each of those virtual machines are set to different values, it is the local scheduler that prioritizes the I/O traffic only in case the local HBA becomes congested. This described host-level capability has existed for several years in ESX Server prior to vSphere 4.1 & 5. It is this local-host level disk scheduler that also enforces the limits set for a given virtual-machine disk. If a limit is set for a given VMDK, the I/O will be controlled by the local disk scheduler so as to not exceed the defined amount of I/O per second.
vSphere 4.1 onwards it has added two key capabilities: (1) the enforcement of I/O prioritization across all ESX servers that share a common datastore, and (2) detection of array-side bottlenecks. These are accomplished by way of a datastore-wide distributed disk scheduler that uses I/O shares per virtual machine to determine whether device queues need to be throttled back on a given ESX server to allow a higher-priority workload to get better performance.
The datastore-wide disk scheduler totals up the disk shares for all the VMDKs that a virtual machine has on the given datastore. The scheduler then calculates what percentage of the shares the virtual machine has compared to the total number of shares of all the virtual machines running on the datastore. As described before, SIOC engages only after a certain device-level latency is detected on the datastore. Once engaged, it begins to assign fewer I/O queue slots to virtual machines with lower shares and more I/O queue slots to virtual machines with higher shares. It throttles back the I/O for the lower-priority virtual machines, those with fewer shares, in exchange for the higher-priority virtual machines getting more access to issue I/O traffic.
However, it is important to understand that the maximum number of I/O queue slots that can be used by the virtual machines on a given host cannot exceed the maximum device-queue depth for the device queue
of that ESX host.
What are the conditions required for the SIOC to work ?
— A large datastore, and lot of VMs in it and this datastore is shared between multiple ESX hosts.
— Need to set disk shares, for the vm’s inside the datastore.
— Based on the LUN/NFS type (made of SSD,SAS,SATA), select the SIOC threshold, and enable SIOC.
— The SIOC will monitor the datastore IOPS usage, latency of VM’s and monitor overall STORAGE Array performance/Latency.
SIOC don’t check the below;
How much Latency created by other physical systems to the array, that is the Array is shared for the PHYSICAL hosts also, and other backup applications/jobs etc. So it gives false ALARMS like “VMware vCenter – Alarm Non-VI workload detected on the datastore] An unmanaged I/O workload is detected on a SIOC-enabled datastore”
Tricky Question – Will it work for one ESXi host ? Simple answer – SIOC is for multiple ESXi hosts, Datastore is exposed to many hosts and in that Datastore contains a lot of VM’s. For single host also we can enable the SIOC if we have the license, but it is not intended for this CASE.
So what is the solution for a single ESXi host ? – Simple just use shares for the VMDK and the ESXi host use Host-Level Disk Scheduler and ensures the DISK SLA.
So is there any real advantage of the SIOC in Real time scenario ? the below experience of me PROVES this..
My scenario: I have 4 ESXi 5 hosts in the cluster, and many FC LUNS from the HP 3PAR storage arrays is mounted. In one or the LUN we have put around 16 VMs and this includes our vCenter and it Database, and inside that Datastore we have a less critical VM (Analytics engine, which is from the vCOPS appliance) and this consumes a lot of IOPS from the storage and eventually effects other critical VM. We enabled the SIOC and monitored for 10 days then we compared the performance of the VMDK and its LATENCY during the PEAK hours.
Below is the Datastore with many VMS and it is shared across 4 HOSTS.
With SIOC disabled.
With SIOC enabled.
So what is the Take away with SIOC ?
Better I/O response time, reduced latency for critical VMS and ensured disk SLA for critical VM’s. So bottom line is the crtical VM’s wont get affected during the event of storage contention and in highly consolidated environment.