VMware I/O queues, “micro-bursting”, and multipathing
Was in Singapore this last week and was talking with the VMware SEs and Cisco SEs there – sharing best practices, tools, and “dos and don’ts”. There was an interesting discussion/whiteboard around the topic of storage network design around FC/FCoE (though this applies to iSCSI as well). The Cisco folks made some really interesting analogies with VoIP and TP “micro-bursting” that I thought were awesome and I wanted to share.
It’s also the reason why there is so much discussion around LUN queues and queue management in vSphere. It’s not just EMC with PowerPath/VE, but I’ve heard 3Par start to talk about “adaptive queuing”, and last week Dell/EqualLogic announced their own addition to the Pluggable Storage Architecture. It’s all about the vStorage APIs, baby 🙂
It’s apropos based on the recent discussion over queuing and comparisons of NFS and block-based storage options here.
Lastly, it’s also a topic of discussion based a recent post here on the topic of whether MPPs like PowerPath/VE are there to help “legacy” arrays and whether “virtualized” arrays get any benefit from more host-side sophistication. I’ve also heard the implication that NMP + Round Robin + ALUA are good enough in all cases – if you choose the right array 🙂
Vaughn and I tend to agree a LOT – and we do, much more than we disagree. We both are super into VMware, which means there’s a lot of things we share. But there are times when we disagree – and this is one of them.
But this isn’t about EMC or Netapp, and isn’t personal, it’s core technical fundamentals – and apply to all storage vendors. If you want to understand and learn more – read on!
Ok – like many Virtual Geek posts – this starts with fundamentals, and will go deep. I know this makes these long, hard slogs, but it’s how I learn… (feedback/critiques welcome!). Also, it’s really important to know – we’re talking about stuff that will NOT affect most customers. As a general principle, I’m a big believer of “start simple, keep it as simple as you can, but understand enough that you know what you need when it gets complicated”.
This is just stuff good to know so that you can diagnose issues, and determine truth from reality in the era of info overload from a bazillion sources based on your own understanding.
Let’s follow an block I/O from a VM through to the back-end disk in the shared storage array.
- The vSCSI queues are the SCSI queues internal to the guest OS. They exist – just like in a physical host. This is why with very large I/O VMs – a recommendation to have multiple vSCSI adapters shows up in the VMware/EMC Reference Architectures for larger Tier 1 VM workloads.
- The VMkernel then has admittance and management for multiple disk requests against a target (LUN) – this is configured in the advanced settings, “Disk.SchedNumReqOustanding” – shown in the screenshot below.
- Next – you hit the LUN queues (a critical element for reasons which become clear later). This is an HBA-specific setting, but in general is 32. This means 32 outstanding queued I/Os (unlike networks, which measure in buffer size in bytes, HBAs measure it in I/Os which can vary in size)
- Then, there are HBA-wide queues for all the LUNs it supports
- Then, the IO makes it’s merry way to the FC/FCoE (even iSCSI) switch, where there are ingress port buffers
- On it’s way out, it goes out an egress port buffer
- Then it hits the storage array. The storage array has in essence an HBA, but in target mode as it’s front-end port, so just like on the host, the array has port-wide target queues (which define the port maximum)
- So far, everything is the same across almost everything under the sun – the next is that there are “Storage Processor” (every vendor calls these different things) maximums – basically how fast the brains can chew up the IOs coming in. While there are array maximums, this is usually far more a function of the next things….
- All array software these days is “virtualized” in many ways – the storage object presented (in this case a LUN) is a meta object (every vendor calls these something different) – composed and presented from internal constructs. This meta-object model is needed for all sorts of things – like thin provisioning, expansion/shrink, dynamic reconfigurations/tiering, deduplication, heck – snapshots were one of the earliest examples of this model.
- These meta objects are always composed of some sort of element object (again, every vendor does this differently) sometimes its a “page”, sometimes a “component”, but there’s some sub-unit of granularity). In some cases, this relationship can be a couple levels deep (the element object itself is a meta object composed of sub-elements).
- Then, leaving the array software stack (which at the lowest level invariably addresses brown spinny things – aka disk devices — on a bus of some sort) these exit on a back-end loop (sometimes this is switched) which has it’s own buffering/queueing mechanisms
- Finally something gets written out to a disk – and all disks have their own disk queues.
This is why I say that queues exist everywhere.
In all networking, while we plan and design for average periods – it’s intrinsically very dynamic picture. That’s where all these buffers and queues come into the picture. Remember that a buffer and a queue are the same thing – just one has a depth in “bytes”, and the other in “I/Os” (which in turn are variable amounts of bytes).
Buffering/queuing allows the network to “absorb” small spikes where things are momentarily really, really busy. BUT if a queue overflows, the whole protocol stack backs off – all the way at the top, and usually fairly sharply. This is true of all networks, and is basic networking concept. It’s true of TCP (where it’s done via TCP windowing), Ethernet (where it’s done via flow control) as it is of FC. In the Ethernet case, flow control will try to back off, but if worst comes to worst, dropping an Ethernet frame is perfectly OK (due to TCP retransmit). BTW – that’s what the whole IEEE Datacenter Bridging standard (aka lossless Ethernet) would add – per priority pause – in essence a flow control state of “wait”, and don’t drop the Ethernet frame on the floor.
Ok – why is this important?
I use an example with DMX4 customer who was having a performance problem with VMware as an example. This customer was suffering poor VM performance – but when we looked at the array, the service time (how long it took to service I/O requests) had a good latency of 6ms, and the spindles weren’t really huffing and puffing. The array front-end ports (FA) were doing just fine. What’s going on – isn’t this the fastest enterprise storage array on the market?
Q: So what was happening?
A: What was happening is that there were very short transient periods – measured in timeframes much shorter than the service time (6ms) – where the number of simultaneous IO requests was very high.
Q: But shouldn’t this have shown up as the spindles being hot and busy?
A: No. Remember that every metric has a timescale. IOps is in seconds. Disk service time is in ms (5-20ms for traditional disk, about 1ms for EFD). If an I/O is served from cache, it’s in microseconds. Switch latencies are in microseconds. Here, the I/O periods were so short that they filled up the ESX LUN queues instantly, causing a “back-off” effect for the guest. These were happily serviced by the SAN and the storage array, which had no idea anything bad was going on.
Q: This seems like it must have been some crazy workload – how many bazillion IOPs was it?
A: That’s the key – it wasn’t that “big”. It was just “spiky”. measured on a normal timescale, it was easily supported by 5 15K drives. But – it was a bunch of small, very low IOps VMs that just happened to issue their slow IOs at a period that coincided.
This is where the Cisco folks in the room perked up. They then said what I thought was a perfect analogy:
“This is an effect were very familiar with. When looking at a network, people generally think in bandwidth ‘will a DS-3 carry it, or do I need an OC-3?’, or ‘will this need 100Mbps/1GbpS/10Gbps?’ But with VoIP and Telepresence (Video) workloads, sometimes a workload that on a normal timescale looks like it would only need 200kbps, when examined at very short timescales, bursts to 400Mbps. We call this ‘microbursting’. It’s one of the factors that sometimes makes customers go from Catalyst 3750 to Catalyst 6500 series switches because they have deeper port buffers that can absorb the microbursts.”
OK – so what makes this effect worse and more likely to affect a customer?
- Any one thing in the path having a really shallow queue/buffer.
- Unbalanced paths – where the service time of one path is different than another – you will tend to get the worst one affecting the others. This can happen anywhere along the path – on the host, the switch, or the array
- Traffic patterns that are bursty. A datastore with many small VMs is more likely to exhibit this statistical pattern that a datastore with a single massive IOPs generator. Remember, it’s not about the IOps or MBps per se, but the “bursty-ness” of the pattern.
- The blended workload of a lot of different hosts (the diagram is hyper-simplified, because in general a storage network – or any network for that matter – has a TON of hosts – each generating all sorts of different IO sizes and patterns).
This is why I think a response that the need for better queue management is somehow a function of solely of the array, or that somehow NMP+Round Robin+ALUA are equivalent to adaptive queuing mechanisms is WAY, WAY off.
Let’s take these one at a time.
1. A shallow queue somewhere.
A shallow queue somewhere in the IO path (or an overflowing port buffer) will cause the I/O to back off. You need the queues to be deep enough to withstand the bursts – sometimes increasing the queue depth is important. Now, if the problem isn’t actually the bursts, but the I/O service time not being sufficient for the sustained workload (aka you have a slow, or underconfigured array), increasing the queue depth will help for only a fraction of a second, after which the deeper queue will still fill up, and now you just have increased the latency even more.
While most customers will never run into this problem, some do. In VMware land – this is usually the fact that the default LUN queue (and corresponding Disk.SchedNumReqOutstanding value) are 32 – which for most use cases is just fine, but when you have a datastore with many small VMs sitting on a single LUN, the possibility of microbursting patterns becomes more likely.
This is covered in this whitepaper, and summarized in this table (which I’ve referred to along with Vaughn). In both this table, and the real world, the column on the left (outstanding I/O per LUN) is generally not the factor that determines the Maximum number of VMs – it’s the “LUN queue on each ESX host” depth column.
If you think you might be running into this problem – it’s pretty easy to diagnose. Launch ESXtop, select the ESX disk device, press "u" to display the ESX disk device monitoring screen, press "enter" to return to the ESX disk device screen. You’ll see a table like this – and QUED is the queue depth.
If this shows as 32 all the time or during “bad performance periods” – check the array service time. If it’s low (6-10ms), you should probably increase the queue depth. If you have a high array service time, then you should consider changing the configuration (usually adding more spindles to the meta object).
What about the array itself? Vaughn went on about that the CLARiiON having an internal queue for element objects (metaLUN components) and a total queue for a meta-object (the LUN), whereas NetApp FAS filers have a “global queue” (and also pointed out that their target ports have a deeper queue). BTW – the idea of a “global queue” is common on devices that don’t use internal block semantics beyond the meta-object. It’s the same on EMC Celerra serving an iSCSI LUN, for example. That doesn’t make one “virtualized” and one “legacy” – it just makes them different.
More importantly – if you understand the topic above, you see why that’s not right to extrapolate this out. Even in the very small example he pointed out from the CLARiiON document (which is an applied guide, not a chest beating document) the array LUN queue is 88 – larger than the maximum that could be configured on the ESX host. Now, if the cluster was wide enough (3 or more nodes) – the array LUN queue could be indeed be bottleneck.
From Vaughn’s post:
“It sure seems that the need for PP VE is a direct result of a less than VMware friendly design within the storage controller.
Maybe a better decision would be to look a NetApp virtualized storage array for your vSphere deployments. By design our arrays don’t have LUN queues. Instead they have target port queues, which are global, very deep, and when combined with RR the queues are aggregated. As I stated earlier each port has a queue of 2,000, or a single dual port target adapter has a queue of 4,000.
This design allows NetApp arrays to avoid the issues that can arise with such shallow queues. The virtualized architecture of a NetApp array is the ideal design for use with VMware’s Round Robin PSP as we don’t have the challenges associated with traditional legacy arrays.”
I won’t go the negative way. Heck, it would be nice to even use reasonable language – like a few qualifiers in that last paragraph… What do they serve in the water over there? 🙂
I will say this – with all arrays – EMC’s, HP’s, HDS’s IBM’s, and NetApp’s – all of them, the back-end design DOES matter, and you should look at it closely – and they are all wildly different architecturally. Each has their advantages and disadvantages – and usually what makes them an advantage in one circumstance is a disadvantage with another. This is why I’m personally of the opinion that it’s more about knowing how to leverage what you happen to have.
In the example he used, PP/VE would not help materially if there were more than 3 ESX hosts, as it would be a likely case of “underconfigured array” – not host-side queuing.
Let’s break this down:
The target maximums are a red herring for the most part. This is the 2000 vs. 1600 he points out — though I would call out that a dual-port example like they use are still a toy compared with a 4 port CX4 Ultraflex I/O module and the smallest CX4s can have MANY of those I/O modules (EMC still the only vendor to my knowledge that enables you to dynamically and non-disruptively change and reconfigure ports). The target port maximums are a red herring because they are generally not a limit except in the largest enterprise cases – which isn’t caused by one host, but literally hundreds or thousands of hosts attached to the array. Generally in those cases, customers aren’t looking at NetApp or EMC CLARiiON for block storage, but rather enterprise arrays like IBM, HDS and EMC Symmetrix.
If you find your array service time is long, or the array LUN queue (if your array has one) is a problem – you need to fix that before you look at queue depths and multipathing. On EMC arrays – this can be done easily and is included as a basic array function. In the negative example Vaughn used, the document clearly calls out how it can be remedied – you could use a Virtual LUN operation, or expand the RAID group. While not an expert on Netapp arrays, I trust Vaughn when he says that they don’t have a LUN queue (like the similar architectural model of block objects on other filesystem-oriented devices like the Celerra or Openfiler). I’d imagine that in their case, storage service time would be a function of the underlying FlexVol/Aggregate/RAID Group configuration, the workload that FlexVol/Aggregate was supporting as well as FAS platform type. I’m sure that one could relatively easily, and non-disruptively change the underlying aggregate/RAID Group configuration of a FlexVol containing a LUN in some fashion? I haven’t read the NetApp guides as closely as NetApp seems to read EMC’s, but I’m sure someone out there could provide the procedure or link to the document in the comments. If you know how to do this, or can link to a post or whitepaper, please do.
The important point is that the queuing generally happens (even in that rinky-dinky 4+1 example that was referenced) at the host, far, far earlier than at the array LUN.
2. Unbalanced paths – this is a case where the queue depth, the number of hops and their corresponding buffers, or the array internals are non symmetrical.
Remember that ALUA is “Asymmetric Logical Unit Access”. Asymmetric. ALUA is a standard which enables a mid-range array to use an internal interconnect between Storage Processors to service I/Os from the storage processor. Enterprise arrays have very large bandwidth between storage engines/directors (or whatever the vendor call them) – so performance models across all ports can be linear and symmetrical. Every Mid-Range array does this internal interconnect differently. I don’t claim to be an expert on how anyone else does it, but on a CX4, this is an internal PCIe 4 or 8 lane bus depending on the unit. I believe that this is a fairly large interconnect for a mid-range array. It’s an important architectural element of an ALUA configuration. Does anyone else know what it would be on the modern Netapp FAS family? If you do, and would like to comment, feel welcome. Now, the amount of bandwidth is high, sure, but compared to the internal bandwidth and latency of the “better” path through the storage processor owning the LUN – it’s decidedly asymmetrical.
To do an ALUA version of the diagram above, I’ve added purple lines to the diagram, showing data flowing down the “less good” paths, then across the internal bus.
ALUA is a good thing when a host MUST for one reason or another have an “active-active” model (and “just because I like the sound of it” isn’t a rational reason – and now that path persistence is fixed in vSphere, the old “MRU path non-persistence” isn’t a good one either) – but without adaptive queuing it is BAD. Simple round robin I/O distribution will drag down the performance to the level of the “less good” paths. That’s why I disagree with the statement I’ve heard others make: NMP+RR+ALUA = NMP+adaptive queuing+ALUA.
3. Traffic patterns that are bursty.
Queue full conditions can be the root of slow steady-state (at the “large timescale”) performance on single-LUN datastores with many individually small IO VMs when there is no adaptive multipathing – particularly comparing a single LUN VMFS datastore vs a NFS datastore. It’s not the SCSI reservation mechanism – which is widely, incorrectly FUDed as the root cause of this. SCSI reservations are the root cause for slow snapshots and VM create/delete on datastores undergoing many meta data updates (again, not a problem with many, many customers, just pointing it out for completeness). Why doesn’t NFS suffer similar behavior? The answer is in the diagram below:
While so many of the elements are the same, and there are still all the same stages of buffering/queuing, some key items are different. There is no LUN queue at the ESX host. The only LUN queues that matter here are those that are behind the meta/element objects (which in the case of NAS are the filesystem/volume management/block layout elements). This means that its all about network link speed, and network buffers. When Vaughn and I were originally authoring the NFS joint post, we debated about how to say this – does it “scale better”? Making one of those multi-vendor posts requires healthy back n’ forth. In the end, we agreed – it “scales differently”. If “scale” is peak IOps, or peak MBps/$, or lower CPU, or lower latency, or failover period, then NFS scales worse. If “scale” is “number of instantaneous I/Os in a short period within network port buffer limits” it scales better.
It’s notable that this NFS datastores are more analagous to spanned VMFS use cases – where the filesystem is supported by many back end LUN objects. If you have multiple LUNs supporting a VMFS volume (see article here), you get parallelism of the LUN queues, and can support larger numbers of VMs in that same datastore. And note – the availability is no better, and no worse statiscally than multiple standalone VMFS datastores. Making spanned VMFS easier is an area of work (along with longer term NFS client improvements) between the storage vendors and VMware.
But – in the end, this is why I personally think that combinations of NAS and block storage models are the most flexible for VMware (and frankly in general). Most customers have VMs with all sorts of different SLAs. If you have a bunch of smaller IO VMs for whom the longer time outs of NFS are good enough, you will be able to put more of them on NFS than you will on a single LUN VMFS datastore. Conversely, if you have a bunch of VMs with a large bandwidth consistent I/O pattern that needs lower latency VMFS will generally win. And if you have VMs that need a very fast failover model, with a ton of predictability, generally VMFS on block wins. But matching the good behavior (no host-side queues) of an NFS datastore without the bad would require a spanned VMFS datastore with many LUN queues and adaptive multipathing to handle the “spikyness” in the combined IO of many (hundreds) of VMs. That’s why I keep saying — “one way, all the time – it’s not the right answer”.
4. The blended workload of a lot of different hosts
This is where things in the original post that prompted this gets outright silly – and I think reflects a misunderstanding of block storage. The diagrams I’ve shown to date are ridiculously oversimplified because they show one host only, one IO only. In the real world, people have many VI3.x and vSphere 4 clusters, each with many hosts hitting the network and the array. Further, they have other hosts of all types hitting the array. Each of those are transmitting I/Os of all sizes all the time. It’s organized chaos.
You can see immediately that even if the array could somehow perfectly serve I/Os, perfectly balanced, at all times (which none of us – none – can do) – there will be varying amounts of queuing and buffering at each element of the stack – from each host, to each host LUN, to each HBA, to each port in the entire fabric, to each array target.
This are the things that drive the need for adaptive queuing models.
One thing I was worried about as I wrote this was that it makes it sound REALLY complicated. It’s not – it’s very simple – I’m just exposing the deep bowels of how this works.
Heck, we humans all look pretty simple on the surface – two eyes, mouth, ears, nose, skin, arms, legs – so on and so forth, but open us up, and the innards get complex. If you start to look at how our neural networks work – well, no one understands that entirely 🙂
BUT – we all know how to basically operate, and for most – that’s enough. Like right now – I know I’m hungry 🙂
- Keep it simple. For almost all cases – you can use a single LUN per VMFS, and be happy with many, many more VMs than most people think (which most people incorrectly think is 12-16 based on ancient best practices)
- in the VMware View 3 reference architecture – we had 64 VMs per datastore with a LUN with a basic 4+1 configuration with no customization of queues – and it did just fine. With more spindles, deeper queues, spanned VMFS – you could do hundreds.
- When I did a VMTN podcast call with John Troyer, I said you could have more than what people think – 32 and more. You’d think it was blasphemy the way people started to say “no way – 10 only, ever!” and stuff like that 🙂
- With block workloads – you need to keep an eye on QUED as one of the primary “performance metrics”.
- This effect is worse if you don’t use DRS – having a whole bunch of VMs on one host, one LUN queue is worse than spreading it around a cluster.
- All arrays deal with queue full conditions differently – some better, some worse, but in all cases, you should try to avoid it. If you see QUED full – check the array service time. If it’s bad, make it faster – any block array worth it’s salt should enable you to do that non-disruptively. If it’s good, consider making the host queues (and the advanced parameter Disk.SchedReqNumOutstanding) larger.
- Having more queues is can help (spanned VMFS) – this has many backing LUNs behind a datastore, and since VMs are spread around,
Multipathing behavior is important when it comes to improving this queue condition.
- Round robin is better than manual static load balancing (ergo vSphere 4 is better than VI3.x). Adaptive (or dynamically weighted round robin) is better than simple round robin (ergo PP/VE vs. NMP). Predictive (where the array target provides input to the dynamic, adaptive algorithm along with the host queue depth state – as EMC block platforms do) is best.
- Automated path policy selection is better than manual. Today, using RR requires MANUALLY selecting this for every target, and every LUN unless a vendor changes writes their own SATP. I would encourage NetApp and every vendor to consider doing that – it seems that Dell/EqualLogic was the next to do it after EMC. The best is where all path configuration – both path selection and path state management all happens automatically, which is what PP/VE does.
EMC absolutely supports native multipathing wherever we can. We support, embrace NMP, MPIO, ALUA.
We also look at things and say “how could we make this better”. This gives customers CHOICE. (including choosing that we’re wrong on all counts of course!)
PowerPath/VE makes vSphere multipathing better – for EMC arrays and 3rd party arrays including HP, HDS, and (with RPQs) IBM. It abstracts out ALL this stuff. You install it, then don’t need to configure it. It requires no LUN by LUN changes – which at any reasonable scale, starts to become well, unwieldy.
Lastly – my advice — don’t listen when anyone (EMC – and if I look closely in the mirror, this includes me — is as guilty of this as anyone – heck look at V-Max) calls their array “virtualized” and implies that this solves everything under the sun. While I guess there’s always some need for marketing, in every respect that matters all arrays are “virtualized”.
This stuff has lots of moving parts under the covers.
Thanks for investing the time here – and I hope this helps!