I’m not sure why I missed this but I’m nothing if not extremely ignorant from time to time. We’ve been getting occasional Purple Screens on our ESXi hosts running 4.1 and EMC PowerPath/VE 5.4 SP2. Seems it might be a problem that needs some hot fixin’. Annoyingly, you need to contact EMC support to get hold of the patch. That’s not a major issue because if you’re running PowerPath you’d know how to do this, but it would be nice if they just belted it out on their support site. But what do I know about the complexities of patch release schedules? Hopefully the patch will be incorporated into the next dot release of PP/VE, otherwise you should consider getting hold of it if you’re having the issues described by VMware here and in EMC’s kb article emc263739 (sorry you’ll have to search for it once you’re logged in).
We haven’t been doing this in our production configurations, but if you want to change the behaviour of SRM with regards to the “snap-xxx” prefix on replica datastores, you need to modify an advanced setting in SRM. So, go to the vSphere client – SRM, and right-click on Site Recovery and select Advanced Settings. Under SanProvider, there’s an option called “SanProvider.fixRecoveredDatastoreNames” with a little checkbox that needs to be ticked to prevent the recovered datastores being renamed with the unsightly prefix.
You can also do this when manually mounting snapshots or mirrors with the help of the esxcfg-volume command – but that’s a story for another time.
I’ll admit I haven’t had to install APC’s PowerChute Network Shutdown product on my ESX(i) hosts lately, as the kind of DCs I’m hosting stuff in doesn’t cater for the UPS in a rack scenario. That said, I’ve had the (mis)fortune to have to install and configure this software in years gone by. As luck would have it, VMware published a KB article recently that points to APC’s support website for information on the how and what. There are two PDF files on the APC support page – one for ESX and one for ESXi. So, if you’re into that kind of thing, then you should find this useful.
VMware recently updated one of its KB articles (the catchily titled “Enabling EVC on a cluster when vCenter is running in a virtual machine“), so I thought I’d include the link here for reference. This is a useful process to understand when you already have a virtualised vCenter server and want to enable EVC. It’s a little bit unwieldy, but sometimes, particularly in legacy environments, you find yourself in these sorts of situations.
In my previous post I discussed some strange behaviour we’d observed in our PoC lab with ESXi 4.1 and the workarounds I’d used to resolve the immediate issues we had with connectivity. We spoke to our local VMware SE and his opinion was pictures or it didn’t happen (my words, not his). In his defence, we had no logs or evidence that the problem happened. Fortunately, we had the problem occur again while doing some testing with OTV.
We took a video of what happened – I’ll post that in the next week when I’ve had time to edit it to a reasonable size. In the meantime my colleague posted on the forums and also raised an SR with VMware. You can find the initial posts we looked at here and here. You can find the post my colleague put up here.
The response from VMware was the best though …
“Thank you for your Support Request.
As we discussed in the call, this is a known issue with the product ESXi 4.1, ie DCUI shows the IP of vmk0 as Management IP, though vmk0 is not configure for Management traffic.
Vmware product engineering team identified the root cause and fixed the issue in the upcoming major release for ESXi. As of now we don’t have ETA on release for this product.
The details on product releases are available in http://www.vmware.com.
I’ll put the video up shortly to demonstrate the symptoms, and hopefully will have the opportunity to follow this up with our TAM this week some time :)
I haven’t had a lot of time to find out what caused some weird behaviour in our lab recently, nor whether what I saw was expected or not. And unfortunately I don’t have screenshots. So you’ll just have to believe me. I’m following the issue up with our local VMware team this week, so hopefully I can provide a KB or something.
In our lab we have some ESXi 4.1 hosts attached to Cisco 3120 switches. Each host has a single, ether-channelled vSwitch, with portgroups and vmkernel ports for the Management Network and vMotion. For whatever reason, the network nerds in our team had to do some IOS firmware updates on the switch stack that the blades were connected to. We didn’t shut anything down, because we wanted to see what would happen.
What we saw was some really weird behaviour. 4 of the 8 hosts (one test data centre) had no issues with connectivity at all. In the other test data centre, 1 of the 4 hosts showed no signs of a problem. Another 2 hosts eventually “came good” after a few hours had elapsed. And one simply wouldn’t play ball. Logging in to the DCUI showed that the Management Network now had a VLAN ID associated with the vMotion network, and had also taken on the IP address of the vMotion network. Now why we have a routable vMotion network in the first place – I’m not so sure. But it _appears_ that the ESXi host had simply decided to go with it. We could connect to the host directly using the vSphere client connecting to the vMotion IP address. No matter how many times / reboots / etc I tried to change the IP via the DCUI, it wouldn’t change.
Not good. In order to get the host sorted out, I had to remove the vMotion portgroup, re-assign the correct IP address using some commands, and then re-create the vMotion portgroup. Here’s how you do it:
esxcfg-vmknic -a “Management Network” -i 192.168.0.31 -n 255.255.255.0
esxcfg-vswitch -v 84 -p “Management Network” vSwitch0
esxcfg-vmknic -a “vMotion” -i 192.168.1.31 -n 255.255.255.0
esxcfg-vswitch -v 86 -p “vMotion” vSwitch0
Then log in to vMA and run this command:
vicfg-vmknic -h labhost31.poc.com -E vMotion
And we’re back up and running. I hope to have a follow-up post when I’ve had a chance to talk it over with VMware.
If you find yourself having problems registering EMC PowerPath 5.4.1 (unsupported) or 5.4.2 (supported) on your HP blades running ESXi 4.1, consider uninstalling the HP offline bundle hpq-esxi4.luX-bundle-1.0. We did, and PowerPath was magically able to talk to the ELM server and retrieve served licenses. I have no idea why CIM-based tools would have this effect, but there you go. Apparently a fix is on the way from HP, but I haven’t verified that yet. I’ll update as soon as I know more.
So I wrote a post a little while ago about filesystem alignment, and why I think it’s important. You can read it here. Obviously, the issue of what to do with guest OS file systems comes up from time to time too. When I asked a colleague to build some VMs for me in our lab environment with the system disks aligned he dismissed the request out of hand and called it an unnecessary overhead. I’m kind of at that point in my life where the only people who dismiss my ideas so quickly are my kids, so I called him on it. He promptly reached for a tattered copy of EMC’s Techbook entitled “Using EMC CLARiiON Storage with VMware vSphere and VMware Infrastructure” (EMC P/N h2197.5 – get it on Powerlink). He then pointed me to this nugget from the book.
I couldn’t let it go, so I reached for my copy (version 4 versus his version 3.1), and found this:
We both thought this wasn’t terribly convincing one way or another, so we decided to test it out. The testing wasn’t super scientific, nor was it particularly rigorous, but I think we got the results that we needed to move forward. We used Passmark‘s PerformanceTest 7.0 to perform some basic disk benchmarks on 2 VMs – one aligned and one not. These are the settings we used for Passmark:
As you can see it’s a fairly simple setup that we’re running with. Now here’s the results of the unaligned VM benchmark.
And here’s the results of the aligned VM.
We ran the tests a few more times and got similar results. So, yeah, there’s a marginal difference in performance. And you may not find it worthwhile pursuing. But I would think, in a large environment like ours where we have 800+ VMs in Production, surely any opportunity to reduce the workload on the array should be taken? Of course, this all changes with Windows 2008. So maybe you should just sit tight until then?
Ever since I was a boy, or, at least, ever since I started working with CLARiiON arrays (when R11 was, er, popular), I’ve been aware of the need to align file systems that lived on the array. I didn’t come to this conclusion myself, but instead found it written in some performance-focused whitepapers on Powerlink. I used to use diskpar.exe with Windows 2000, and fdisk for linux hosts. As time moved on Microsoft introduced diskpart.exe, which did a bunch of other partition things as well. So it sometimes surprises me that people still debate the issue, at least from a CLARiiON perspective. I’m not actually going to go into why you should do it, but I am going to include a number of links that I think are useful when it comes to this issue.
It pains me to say this, but Microsoft have probably the best, publicly available article on the issue here. The succinctly titled “Disk performance may be slower than expected when you use multiple disks in Windows Server 2003, in Windows XP, and in Windows 2000” is a pretty thorough examination of why or why not you’ll see dodgy performance from that expensive array you just bought.
Of course, it doesn’t mean that the average CLARiiON owner gets any less cranky with the situation. I can only assume that the sales guy has given them such a great spiel about how awesome their new array is that they couldn’t possibly need to do anything further to improve its performance. If you have access to the EMC Community forums, have a look at this and this.
If you have access to Powerlink you should really read the latest performance whitepaper relating to FLARE 29. It has a bunch of great stuff in it that goes well beyond file system alignment. And if you have access to the knowledge base, look for emc143897 – Do disk partitions created by Windows 2003 64-bit servers require file system alignment? – Hells yes they do.
emc151782 – Navisphere Analyzer reports disk crossings even after aligning disk partitions using the DISKPAR tool. – Disk crossings are bad. Stripe crossings are not.
emc135197 – How to align the file system on an ESX volume presented to a Windows Virtual Machine (VM). Basic stuff, but important to know if you’ve not had to do it before.
Finally, Duncan Epping’s post on VM disk alignment has some great information, in an easy to understand diagram. I also recommend you look at the comments section, because that’s where the fun starts.
Kids, if someone says that file system alignment isn’t important, punch them in the face. In a Windows environment, get used to using diskpart.exe. In an ESX environment, create your VMFS using the vSphere client, and then make sure you’re aligning the file systems of the guests as well. Next week I’ll try and get some information together about why stripe crossings on a CLARiiON aren’t the end of the world, but disk crossings are the first sign of the apocalypse. That is all.