Random Short Take #85

Welcome to Random Short Take #85. Let’s get random.

VMware, EMC PowerPath/VE, PSODs and a hot-fix

I’m not sure why I missed this but I’m nothing if not extremely ignorant from time to time. We’ve been getting occasional Purple Screens on our ESXi hosts running 4.1 and EMC PowerPath/VE 5.4 SP2. Seems it might be a problem that needs some hot fixin’. Annoyingly, you need to contact EMC support to get hold of the patch. That’s not a major issue because if you’re running PowerPath you’d know how to do this, but it would be nice if they just belted it out on their support site. But what do I know about the complexities of patch release schedules? Hopefully the patch will be incorporated into the next dot release of PP/VE, otherwise you should consider getting hold of it if you’re having the issues described by VMware here and in EMC’s kb article emc263739 (sorry you’ll have to search for it once you’re logged in).

VMware – Some old school support

We had a network connectivity issue on one of our remote hosts at work the other day and Ops asked for my help. It turned out to be a cabling issue following a cyclone up north. I think the picture speaks for itself. Oh yeah, rocking with the MUI once again.

VMware – Installing APC PowerChute Network Shutdown

I’ll admit I haven’t had to install APC’s PowerChute Network Shutdown product on my ESX(i) hosts lately, as the kind of DCs I’m hosting stuff in doesn’t cater for the UPS in a rack scenario. That said, I’ve had the (mis)fortune to have to install and configure this software in years gone by. As luck would have it, VMware published a KB article recently that points to APC’s support website for information on the how and what. There are two PDF files on the APC support page – one for ESX and one for ESXi. So, if you’re into that kind of thing, then you should find this useful.

EMC PowerPath 5.4 SPx, ESXi 4.1 and HP CIM offline bundle

If you find yourself having problems registering EMC PowerPath 5.4.1 (unsupported) or 5.4.2 (supported) on your HP blades running ESXi 4.1, consider uninstalling the HP offline bundle hpq-esxi4.luX-bundle-1.0. We did, and PowerPath was magically able to talk to the ELM server and retrieve served licenses. I have no idea why CIM-based tools would have this effect, but there you go. Apparently a fix is on the way from HP, but I haven’t verified that yet. I’ll update as soon as I know more.

File system Alignment redux

So I wrote a post a little while ago about filesystem alignment, and why I think it’s important. You can read it here. Obviously, the issue of what to do with guest OS file systems comes up from time to time too. When I asked a colleague to build some VMs for me in our lab environment with the system disks aligned he dismissed the request out of hand and called it an unnecessary overhead. I’m kind of at that point in my life where the only people who dismiss my ideas so quickly are my kids, so I called him on it. He promptly reached for a tattered copy of EMC’s Techbook entitled “Using EMC CLARiiON Storage with VMware vSphere and VMware Infrastructure” (EMC P/N h2197.5 – get it on Powerlink). He then pointed me to this nugget from the book.

I couldn’t let it go, so I reached for my copy (version 4 versus his version 3.1), and found this:

We both thought this wasn’t terribly convincing one way or another, so we decided to test it out. The testing wasn’t super scientific, nor was it particularly rigorous, but I think we got the results that we needed to move forward. We used Passmark‘s PerformanceTest 7.0 to perform some basic disk benchmarks on 2 VMs – one aligned and one not. These are the settings we used for Passmark:

As you can see it’s a fairly simple setup that we’re running with. Now here’s the results of the unaligned VM benchmark.

And here’s the results of the aligned VM.

We ran the tests a few more times and got similar results. So, yeah, there’s a marginal difference in performance. And you may not find it worthwhile pursuing. But I would think, in a large environment like ours where we have 800+ VMs in Production, surely any opportunity to reduce the workload on the array should be taken? Of course, this all changes with Windows 2008. So maybe you should just sit tight until then?

vSphere 4.1 GA

VMware vSphere 4.1 is now available for download. Release notes can be found here. Enjoy!

Broken sVMotion

It’s been a very long few weeks, and I’m looking forward to taking a few weeks off. I’ve been having a lot of fun doing a data migration from a CLARiiON CX3-40f and CX700 to a shiny, new CX4-960. To give you some background, instead of doing an EMC-supported data in place upgrade in August last year, we decided to buy another 4-960 (on top of the already purchased 4-960 upgrade kit) and migrate the data manually to the new array. The only minor probelm with this is that there’s about 100TB of data in various forms that I need to get on to the new array. Sucks to be me, but I am paid by the hour.

I started off by moving VMs from one cluster to the new array using sVMotion, as there was a requirement to be a bit non-disruptive where possible. Unfortunately, on the larger volumes attached to fileservers, I had a few problems. I’ll list them, just for giggles:

There were 3 800GB volumes and 5 500GB volumes and 1 300GB volume that had 0% free space on the VMFS. And I mean 0%, not 5MB or 50MB, 0%. So that’s not cool for a few reasons. ESX likes to update journaling data on VMFS, because it’s a journaling filesystem. If you don’t give it space to do this, it can’t do it, and you’ll find volumes start to get remounted with very limited writeability. If you try to storage VMotion these volumes, you’ll again be out of luck, as it wants to keep a dmotion file on the filesystem to track any changes to the vmdk file while the migration is happening. I found my old colleague Leo’s post to be helpful when a few migrations fail, but unfortunately the symptoms he described were not the same as mine, in my case the VMs fell over entirely. More info from VMware can be had here.

If you want to move just a single volume, you try your luck with this method, which I’ve used successfully before. But I was tired, and wanted to use vmkfstools since I already had an unexpected outage and had to get something sorted.

The problem with vmkfstools is that there’s no restartable copy option – as far as I know. So when you get 80% through a 500GB file and it fails, well, that’s 400GB of pointless copying and time you’ll never get back. Multiply that out over 3 800GB volumes and a few ornery 500GB vmdks and you’ll start to get a picture of what kind of week I had.

After suffering through a number of failures, I ended up taking one node out of the 16-node cluster and placing it and its associated datastores (the ones I needed to migrate) in their own Navisphere storage group. That way, there’d be no “busy-ness” affecting the migration process (we had provisioned about 160 LUNs to the cluster at this stage and we were, obviously, getting a few “SCSI reservation conflicts” and “resource temporarily unavailable” issues). This did the trick, and I was able to get some more stuff done. now there’s only about 80TB to go before the end of April. Fun times.

And before you ask why didn’t I use SAN Copy? I don’t know, I suppose I’ve never had the opportunity to test it with ESX, and while I know that underlying technology is the same as MirrorView, I just really didn’t feel I was in a position to make that call. I probably should have just done it, but I didn’t really expect that I’d have as much trouble as I did with sVMotion and / or vmkfstools. So there you go.

Locked vmdk files

Somehow, a colleague of mine put an ESX host in a cluster into maintenance mode while VMs were still running. Or maybe it just happened to crash when she was about to do this. I don’t know how, and I’m not sure I still believe it, but I saw some really weird stuff last week. the end result was that VMs powered off ungracefully, and the host became unresponsive, and things were generally bad. We started adding VMs back to other hosts, but one VM had locked files. Check out this entry at Gabe’s Virtual World on how to address this, but basically you want to ps, grep and kill -9 some stuff.

ps -elf | grep vmname

kill -9 PID

And you’ll find that it’s probably the vmdk files that are locked, not necessarily the vmx file.

vmkfstools sometimes not so nice

I’m hoping to do some posts on the vCLI when I have some time in the near future. But I was trolling through the release notes and came across this particular gem in the known issues section: 

Running vmkfstools -C does not prompt for confirmation. When you run vmkfstools -C to create a VMFS (Virtual Machine File System) on a partition that already has a VMFS on it, the command erases the existing VMFS and creates the new VMFS without prompting for confirmation.
Workaround: No workaround. Check the command carefully before running it.”

Nice! I can imagination a certain special consternation one would feel when running this on the wrong production LUN …