Random Short Take #85

Welcome to Random Short Take #85. Let’s get random.

VMware, EMC PowerPath/VE, PSODs and a hot-fix

I’m not sure why I missed this but I’m nothing if not extremely ignorant from time to time. We’ve been getting occasional Purple Screens on our ESXi hosts running 4.1 and EMC PowerPath/VE 5.4 SP2. Seems it might be a problem that needs some hot fixin’. Annoyingly, you need to contact EMC support to get hold of the patch. That’s not a major issue because if you’re running PowerPath you’d know how to do this, but it would be nice if they just belted it out on their support site. But what do I know about the complexities of patch release schedules? Hopefully the patch will be incorporated into the next dot release of PP/VE, otherwise you should consider getting hold of it if you’re having the issues described by VMware here and in EMC’s kb article emc263739 (sorry you’ll have to search for it once you’re logged in).

VMware – Some old school support

We had a network connectivity issue on one of our remote hosts at work the other day and Ops asked for my help. It turned out to be a cabling issue following a cyclone up north. I think the picture speaks for itself. Oh yeah, rocking with the MUI once again.

VMware – SRM advanced setting for snap prefix

We haven’t been doing this in our production configurations, but if you want to change the behaviour of SRM with regards to the “snap-xxx” prefix on replica datastores, you need to modify an advanced setting in SRM. So, go to the vSphere client – SRM, and right-click on Site Recovery and select Advanced Settings. Under SanProvider, there’s an option called “SanProvider.fixRecoveredDatastoreNames” with a little checkbox that needs to be ticked to prevent the recovered datastores being renamed with the unsightly prefix.

You can also do this when manually mounting snapshots or mirrors with the help of the esxcfg-volume command – but that’s a story for another time.

VMware – Installing APC PowerChute Network Shutdown

I’ll admit I haven’t had to install APC’s PowerChute Network Shutdown product on my ESX(i) hosts lately, as the kind of DCs I’m hosting stuff in doesn’t cater for the UPS in a rack scenario. That said, I’ve had the (mis)fortune to have to install and configure this software in years gone by. As luck would have it, VMware published a KB article recently that points to APC’s support website for information on the how and what. There are two PDF files on the APC support page – one for ESX and one for ESXi. So, if you’re into that kind of thing, then you should find this useful.

Enabling EVC on a cluster when vCenter is running in a virtual machine

VMware recently updated one of its KB articles (the catchily titled “Enabling EVC on a cluster when vCenter is running in a virtual machine“), so I thought I’d include the link here for reference. This is a useful process to understand when you already have a virtualised vCenter server and want to enable EVC. It’s a little bit unwieldy, but sometimes, particularly in legacy environments, you find yourself in these sorts of situations.

ESXi 4.1 network weirdness – Part 2

In my previous post I discussed some strange behaviour we’d observed in our PoC lab with ESXi 4.1 and the workarounds I’d used to resolve the immediate issues we had with connectivity. We spoke to our local VMware SE and his opinion was pictures or it didn’t happen (my words, not his). In his defence, we had no logs or evidence that the problem happened. Fortunately, we had the problem occur again while doing some testing with OTV.

We took a video of what happened – I’ll post that in the next week when I’ve had time to edit it to a reasonable size. In the meantime my colleague posted on the forums and also raised an SR with VMware. You can find the initial posts we looked at here and here. You can find the post my colleague put up here.

The response from VMware was the best though …

Thank you for your Support Request.

As we discussed in the call, this is a known issue with the product ESXi 4.1, ie DCUI shows the IP of vmk0 as Management IP, though vmk0 is not configure for Management traffic.

Vmware product engineering team identified the root cause and fixed the issue in the upcoming major release for ESXi. As of now we don’t have ETA on release for this product.

The details on product releases are available in http://www.vmware.com.

Based on our last communication, it appears that this case is ready to be closed. I will proceed with closing your Support Request today.

Regards,”

I’ll put the video up shortly to demonstrate the symptoms, and hopefully will have the opportunity to follow this up with our TAM this week some time :)

ESXi 4.1 network weirdness – why local TSM is really handy

I haven’t had a lot of time to find out what caused some weird behaviour in our lab recently, nor whether what I saw was expected or not. And unfortunately I don’t have screenshots. So you’ll just have to believe me. I’m following the issue up with our local VMware team this week, so hopefully I can provide a KB or something.

In our lab we have some ESXi 4.1 hosts attached to Cisco 3120 switches. Each host has a single, ether-channelled vSwitch, with portgroups and vmkernel ports for the Management Network and vMotion. For whatever reason, the network nerds in our team had to do some IOS firmware updates on the switch stack that the blades were connected to. We didn’t shut anything down, because we wanted to see what would happen.

What we saw was some really weird behaviour. 4 of the 8 hosts (one test data centre) had no issues with connectivity at all. In the other test data centre, 1 of the 4 hosts showed no signs of a problem. Another 2 hosts eventually “came good” after a few hours had elapsed. And one simply wouldn’t play ball. Logging in to the DCUI showed that the Management Network now had a VLAN ID associated with the vMotion network, and had also taken on the IP address of the vMotion network. Now why we have a routable vMotion network in the first place – I’m not so sure. But it _appears_ that the ESXi host had simply decided to go with it. We could connect to the host directly using the vSphere client connecting to the vMotion IP address. No matter how many times / reboots / etc I tried to change the IP via the DCUI, it wouldn’t change.

Not good. In order to get the host sorted out, I had to remove the vMotion portgroup, re-assign the correct IP address using some commands, and then re-create the vMotion portgroup. Here’s how you do it:

esxcfg-vmknic -d vMotion
esxcfg-vswitch -D vMotion vSwitch0

esxcfg-vmknic -a “Management Network” -i 192.168.0.31 -n 255.255.255.0

esxcfg-vswitch -v 84 -p “Management Network” vSwitch0

esxcfg-vmknic -a “vMotion” -i 192.168.1.31 -n 255.255.255.0

esxcfg-vswitch -v 86 -p “vMotion” vSwitch0

Then log in to vMA and run this command:

vicfg-vmknic -h labhost31.poc.com -E vMotion

And we’re back up and running. I hope to have a follow-up post when I’ve had a chance to talk it over with VMware.

EMC PowerPath 5.4 SPx, ESXi 4.1 and HP CIM offline bundle

If you find yourself having problems registering EMC PowerPath 5.4.1 (unsupported) or 5.4.2 (supported) on your HP blades running ESXi 4.1, consider uninstalling the HP offline bundle hpq-esxi4.luX-bundle-1.0. We did, and PowerPath was magically able to talk to the ELM server and retrieve served licenses. I have no idea why CIM-based tools would have this effect, but there you go. Apparently a fix is on the way from HP, but I haven’t verified that yet. I’ll update as soon as I know more.

File system Alignment redux

So I wrote a post a little while ago about filesystem alignment, and why I think it’s important. You can read it here. Obviously, the issue of what to do with guest OS file systems comes up from time to time too. When I asked a colleague to build some VMs for me in our lab environment with the system disks aligned he dismissed the request out of hand and called it an unnecessary overhead. I’m kind of at that point in my life where the only people who dismiss my ideas so quickly are my kids, so I called him on it. He promptly reached for a tattered copy of EMC’s Techbook entitled “Using EMC CLARiiON Storage with VMware vSphere and VMware Infrastructure” (EMC P/N h2197.5 – get it on Powerlink). He then pointed me to this nugget from the book.

I couldn’t let it go, so I reached for my copy (version 4 versus his version 3.1), and found this:

We both thought this wasn’t terribly convincing one way or another, so we decided to test it out. The testing wasn’t super scientific, nor was it particularly rigorous, but I think we got the results that we needed to move forward. We used Passmark‘s PerformanceTest 7.0 to perform some basic disk benchmarks on 2 VMs – one aligned and one not. These are the settings we used for Passmark:

As you can see it’s a fairly simple setup that we’re running with. Now here’s the results of the unaligned VM benchmark.

And here’s the results of the aligned VM.

We ran the tests a few more times and got similar results. So, yeah, there’s a marginal difference in performance. And you may not find it worthwhile pursuing. But I would think, in a large environment like ours where we have 800+ VMs in Production, surely any opportunity to reduce the workload on the array should be taken? Of course, this all changes with Windows 2008. So maybe you should just sit tight until then?