EMC – Silly things you can do with stress testing – Part 2

I’ve got a bunch of graphs that indicate you can do some bad things to EFDs when you run certain SQLIO stress tests against them and compare the results to FC disks. But EMC is pushing back on the results I’ve gotten for a number of reasons. So in the interests of keeping things civil I’m not going to publish them – because I’m not convinced the results are necessarily valid and I’ve run out of time and patience to continue testing. Which might be what EMC hoped for – or I might just be feeling a tad cynical.

What I have learnt though, is that it’s very easy to generate QFULL errors on a CX4 if you follow the EMC best practice configs for Qlogic HBAs and set the execution throttle to 256. In fact, you might even be better off leaving it at 16, unless you have a real requirement to set it higher. I’m happy for someone to tell me why EMC suggests it be set to 256, because I’ve not found a good reason for it yet. Of course, this is dependent on a number of environmental factors, but the 256 figure still has me scratching my head.

Another thing that we uncovered during stress testing had something to do with the Queue Depth of LUNs. For our initial testing, we had a Storage Pool created with 30 * 200GB EFDs, 70 * 450GB FC spindles, and 15 * 1TB SATA-II Spindles with FAST-VP enabled. The LUNs on the EFDs were set to no data movement – so everything sat on the EFDs. We were getting kind of underwhelming performance stats out of this config, and it seems like the main culprit was the LUN queue depth. In a traditonal RAID Group setup, the queue depth of the LUN is (14 * (the number of data drives in the LUN) + 32). So for a RAID 5 (4+1) LUN, the queue depth is 88. If, for some reason, you want to drive a LUN harder, you can increase this by using MetaLUNs, with the sum of the components providing the LUN’s queue depth. What we observed on the Pool LUN, however, was that this seemed to stay fixed at 88, regardless of the number of internal RAID Groups servicing the Pool LUN. This seems like it’s maybe a bad thing, but that’s probably why EMC quietly say that you should stick to traditional MetaLUNs and RAID Groups if you need particular performance characteristics.

So what’s the point I’m trying to get at? Storage Pools and FAST-VP are awesome for the majority of workloads, but sometimes you need to use more traditional methods to get what you want. Which is why I spent last weekend using the LUN Migration tool to move 100TB of blocks around the array to get back to the traditional RAID Group / MetaLUN model. Feel free to tell me if you think I’ve gotten this arse-backwards too, because I really want to believe that I have.