What have I been doing? – Part 1

I recently had the “pleasure” of working on a project before Christmas that had a number of, er, interesting elements involved.  During the initial scoping, the only thing mentioned was two new arrays (with MirrorView/Asynchronous), a VMware ESX upgrade, and a few new ESX hosts.  But here’s what there really was:

– 4 NetWare 6.5 hosts in 2 NCS clusters;
– An EMC CLARiiON CX200 (remember them?) hosting a large amount (around 5TB) of NetWare and VMware data;
– A single McData switch running version 7 firmware;
– 2 new Dell hosts with incompatible CPUs with the existing 2950 hosts;
– A memory upgrade to the two existing nodes that meant one host had 20GB and the other had 28GB;
– A MirrorView target full of 1TB SATA-II spindles;
– A DR target with only one switch;
– Singley-attached (ie one HBA) hosts everywhere;
– An esXpress installation that needed to be upgraded / re-installed;
– A broken VUM implementation.

Hmmm, sound like fun? It kind of was, just because some of the things I had to do to get it to work were things I wouldn’t normally expect to do.  I don’t know whether this is such a good thing.  There’re a number of things that popped up during the project, each of which would benefit from dedicated blog posts.  But given that I’m fairly lazy, I think I’ll try and cram it all into one post.

Single switches and single HBAs are generally a bad idea

<rant> When I first started working on SANs about 10 minutes ago, I was taught that redundancy in a mid-range system is a good thing. The components that go into your average mid-range systems, while being a bit more reliable than your average gamedude’s gear, are still prone to failure. So you build a level of redundancy into the system such that when, for whatever reason, a component fails (such as a disk, fibre cable, switch or HBA), the system stays up and running. On good systems, the only people who know there’s a failure are the service personnel called out to replace the broken component in question. On a cheapy system, like the one you keep the Marketing Department’s critical morning tea photos on, a few more people might know about it. Mid-range disk arrays can run into the tens and hundreds of thousands of dollars, so sometimes people think that they can save a bit of cash but cutting a few corners by, for example, leaving the nodes with single HBAs, or having only one switch at the DR site, or using SATA as a replication target. But I would argue that, given your spending all of this cash on a decent mid-range array, why wouldn’t you do all you can to ensure it’s available all the time? Saying “My cluster provides the resiliency / We’re not that mission critical / I needed to shave $10K off the price” strikes me as counter-intuitive to the goal of providing reliable, available and sustainable infrastructure solutions. </rant>

All that said, I do understand that sometimes the people making the purchasing decisions aren’t necessarily the best-equipped people to understand the distinction between single- and dual-attached hosts, and what good performance is all about. All I can suggest is that you start with a solid design, and do the best you can to keep that design through to deployment. So what should you be doing? For a simple FC deployment (let’s assume two switches, one array, two HBAs per host), how about something like this?


Notice that there’s no connection between the two FC switches here. That’s right kids, you don’t want to merge these fabrics. The idea is that if you munt the config on one switch, it won’t automatically pass that muntedness on to the peer switch. This is a good thing if you, like me, like to do zoning from the CLI but occassionally forget to check the syntax and spelling before you make changes. And for the IBM guys playing at home, the “double redundant loops” excuse doesn’t apply to the CLARiiON. So do yourself a favour, and give yourself 4 paths to the box, dammit!

And don’t listen to Apple and get all excited about just using one switch either – not that they’re necessarily saying that, of course … Or that they’re necessarily saying anything much at all about storage any more, unless Time Capsules count as storage. But I digress …