Enabling EVC on a cluster when vCenter is running in a virtual machine

VMware recently updated one of its KB articles (the catchily titled “Enabling EVC on a cluster when vCenter is running in a virtual machine“), so I thought I’d include the link here for reference. This is a useful process to understand when you already have a virtualised vCenter server and want to enable EVC. It’s a little bit unwieldy, but sometimes, particularly in legacy environments, you find yourself in these sorts of situations.

ESXi 4.1 network weirdness – why local TSM is really handy

I haven’t had a lot of time to find out what caused some weird behaviour in our lab recently, nor whether what I saw was expected or not. And unfortunately I don’t have screenshots. So you’ll just have to believe me. I’m following the issue up with our local VMware team this week, so hopefully I can provide a KB or something.

In our lab we have some ESXi 4.1 hosts attached to Cisco 3120 switches. Each host has a single, ether-channelled vSwitch, with portgroups and vmkernel ports for the Management Network and vMotion. For whatever reason, the network nerds in our team had to do some IOS firmware updates on the switch stack that the blades were connected to. We didn’t shut anything down, because we wanted to see what would happen.

What we saw was some really weird behaviour. 4 of the 8 hosts (one test data centre) had no issues with connectivity at all. In the other test data centre, 1 of the 4 hosts showed no signs of a problem. Another 2 hosts eventually “came good” after a few hours had elapsed. And one simply wouldn’t play ball. Logging in to the DCUI showed that the Management Network now had a VLAN ID associated with the vMotion network, and had also taken on the IP address of the vMotion network. Now why we have a routable vMotion network in the first place – I’m not so sure. But it _appears_ that the ESXi host had simply decided to go with it. We could connect to the host directly using the vSphere client connecting to the vMotion IP address. No matter how many times / reboots / etc I tried to change the IP via the DCUI, it wouldn’t change.

Not good. In order to get the host sorted out, I had to remove the vMotion portgroup, re-assign the correct IP address using some commands, and then re-create the vMotion portgroup. Here’s how you do it:

esxcfg-vmknic -d vMotion
esxcfg-vswitch -D vMotion vSwitch0

esxcfg-vmknic -a “Management Network” -i -n

esxcfg-vswitch -v 84 -p “Management Network” vSwitch0

esxcfg-vmknic -a “vMotion” -i -n

esxcfg-vswitch -v 86 -p “vMotion” vSwitch0

Then log in to vMA and run this command:

vicfg-vmknic -h labhost31.poc.com -E vMotion

And we’re back up and running. I hope to have a follow-up post when I’ve had a chance to talk it over with VMware.

2009 and penguinpunk.net

It was a busy year, and I don’t normally do these type of posts, but I thought I’d try to do a year in review type thing so I can look back at the end of 2010 and see what kind of promises I’ve broken. Also, the Exchange Guy will no doubt enjoy the size comparison. You can see what I mean by that here.

In any case, here’re some broad stats on the site. In 2008 the site had 14966 unique visitors according to Advanced Web Statistics 6.5 (build 1.857). But in 2009, it had 15856 unique visitors – according to Advanced Web Statistics 6.5 (build 1.857). That’s an increase of some 890 unique visitors, also known as year-on-year growth of approximately 16.82%. I think. My maths are pretty bad at the best of times, but I normally work with storage arrays, not web statistics. In any case, most of the traffic is no doubt down to me spending time editing posts and uploading articles, but it’s nice to think that it’s been relatively consistent, if not a little lower than I’d hoped. This year (2010 for those of you playing at home), will be the site’s first full year using Google analytics, so assuming I don’t stuff things up too badly, I’ll have some prettier graphs to present this time next year. That said, MYOB / smartyhost are updating the web backend shortly so I can’t make any promises that I’ll have solid stats for this year, or even a website :)

What were the top posts? Couldn’t tell you. I do, however, have some blogging-type goals for the year:

1. Blog with more focus and frequency – although this doesn’t mean I won’t throw in random youtube clips at times.

2. Work more on the promotion of the site. Not that there’s a lot of point promoting something if it lacks content.

3. Revisit the articles section and revise where necessary. Add more articles to the articles page.

On the work front, I’m architecting the move of my current employer from a single data centre to a 2+1 active / active architecture (from a storage and virtualisation perspective). There’s more blades, more CLARiiON, more MV/S, some vSphere and SRM stuff, and that blasted Cisco MDS fabric stuff is involved too. Plus a bunch of stuff I’ve probably forgotten. So I think it will be a lot of fun, and a great achievement if we actually get anything done by June this year. I expect there’ll be some moments of sheer boredom as I work my way through 100s of incremental SAN Copies and sVMotions. But I also expect there will be moments of great excitement when we flick the switch on various things and watch a bunch of visio illustrations turn into something meaningful.

Or I might just pursue my dream of blogging about the various media streaming devices on the market. Not sure yet. In any case, thanks for reading, keep on reading, tell your friends, and click on the damn Google ads.

VCP410 exam pass

I passed my VCP410 exam yesterday with a score of 450. I’m pleased to have finally gotten it out of the way as, even though I had signed up for the second-shot voucher with VMware, there seem to be no free slots in the 4 testing centres in Brisbane this month. After the epic fail of my previous employer to stay afloat, I also had to pony up the AU$275 myself, so I felt a little bit more pressure than I normally would when taking one of these exams.

I found the following resources of particular use:

Brian’s list of “Useful study material”;

Duncan’s VCP 4 post;

Simon Long’s blog and practice exams;

and the VCP4 Mock Examfrom VMware on the mylearn page.

I also recommend you read through as much reference material / admin guides that you can, and remember that what you’re taught in the course doesn’t always correlate with what you see in the exam. Good luck!

So you’ve changed the IP address and munted something, now what?

Yesterday a colleague of mine was having some issues performing sVMotions on guests sitting in a development ESX 3.5 cluster. He kept getting an error along the lines of:

“IP address change for 10.x.x.x to 10.x.x.y not handled, SSL certificate verification is not enabled.”

They had changed the Service Console IP address of the host manually to perform some “secure” guest migrations previously (don’t ask me why – there’s always my way or the hard way), and basically the IP address of the host hadn’t been updated in the vxpa.cfg file. VMware has a 2-3 step process to reoslve the issue, which ultimately will require you to pull the host out of the cluster and re-add it to vCenter. It’s not a big deal, but it can be confusing when things seem to be working, but aren’t really. You can read more about it here.

The Basics AKA Death by Screenshot

I’ve created a new page, imaginatively titled “Articles”, that has a number of articles I’ve done recently covering various simple operational or implementation-focused tasks. You may or may not find them useful. I hope this doesn’t become my personal technical documentation graveyard, although I have a feeling that a number of the documents will probably stay at version 0.01 until such time as the underlying technology no longer exists. Enjoy!

sVMotion with snapshot bad

You know when it says in the release notes, and pretty much every forum on the internet, that doing sVMotion migrations with snapshots attached to a vmdk is bad? Turns out they were right, and you might just end up munting your vmdk file in the process. So you might just need this link to recreate the vmdk. You may find yourself in need of this process to commit the snapshot as well. Or, if you’re really lucky, you’ll find yourself with a vmsn file that references a missing vmdk file. Wow, how rad! To work around this, I renamed the vmsn to .old, ran another snapshot, and then committed the snapshots. I reiterate that I think snapshots are good when you’re in a tight spot, in the same way that having a baseball bat can be good when you’re attacked in your home. But if you just go around swinging randomly, something’s going to get broken. Bad analogy? Maybe, but I think you get what I’m saying here.

To recap, when using svmotion.pl with VIMA, here’s the syntax:

svmotion.pl --datacenter=network.internal --url=https://virtualcenter.network.internal/sdk --username=vcadmin --vm="[VMFS_02] host01/host01.vmx:VMFS_01"

Of course, my preferred method is here:

svmotion --interactive


VMware VirtualCenter 2.5 Update 4

I’ve been nuts deep in a SAN migration project recently and promptly missed the announcement that VMware VirtualCenter 2.5 Update 4 is now available for download. I haven’t had time to put it through its paces yet, but noticed in the release notes that some plugins have been updated, some more useful things have been added to Virtual Machine monitoring, and this little nugget with esxcfg-mpath (a command dear to my heart) still isn’t fixed. But, hey, it’s still better than Sun’s CAM.

What have I been doing? – Part 3

CPU ID Masking

Sometimes vendors (in this case Dell) sell servers that have the same model name (in this case Dell PowerEdge 2950), but have CPUs in them from Intel that are incompatible as far as VMotion is concerned. In this case we had two existing nodes, purchased about 12 months ago, running Intel Xeon E5310 CPUs, and two new nodes running Intel Xeon E5410 CPUs. Even though they’re both dual-socket, quad-core hosts, and even though they are ostensibly the same machines, running the same BIOS revisions, VMotion doesn’t like it. This is Dell’s fault necessarily, they like to sell whatever CPU is cheap and performance focussed. It just so happens that, in the last 12 months, Intel have made a number of developments, and have started selling CPUs that do more for the same price as the older one. I have a friend who used to work at Intel and knew all of the model names and codes and compatibilities. I don’t though, so have a look here for the rundown on what Xeon chips are what. Basically, moving from Clovertown to Harpertown has caused the problem. Awesome.

When we tried to VMotion a Virtual Machine from an existing host to a new host, we got this error:


VMware has a spiel on the issue here, and some possible solutions, depending on the version of ESX you’re running. A few other people have had the issue, and have discussed it here and here.

What’s frustating is that we were able to VMotion from the Acer hosts running dual-core CPUs to the existing quad-core Dell hosts with no problem. To think that we couldn’t then go from the Dell hosts to the new Dell hosts seems just, well, silly.

I didn’t want to setup the CPU ID masking on each Virtual Machine, so I elected to make the change on the VirtualCenter host. I edited the vpxd.cfg file, which is located by default in “C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter”. Here’s what I setup on the VirtualCenter host:


I may have put too many settings in. But it worked fine. But I’d be feeling like if the consultant has to do that, I’d be chasing my sales guy for an upgrade to some compatible CPUs.

Upgrading VMware Tools on Netware 6.5

If you find yourself working on what some uncharitably call the dodo of network operating systems – NetWare 6.5 – and need to upgrade the VMware Tools – these instructions will help get you on your way.

To start, select “Install VMware Tools” on the VM.
The volume automounts on the NetWare guest.
Run vmwtools:\setup.ncf.
Some stuff happens (sorry I forgot to take some screenshots).
You’ll see “VMware Tools for NetWare are now running” once the upgrade is complete. You should then reboot the guests.

esXpress upgrade notes

esXpress from PHD Technologies is a neat backup and recovery tool. According to the website, it is “The ONLY scalable and completely fault tolerant backup, restoration and disaster recovery solution for Virtual Infrastructure 3.x. Whether you have 1TB of data or 50TB, esXpress makes a 100% complete backup of your entire virtual farm every night”. So maybe they’re blowing their own trumpet a little, but whenever I’ve had to use esXpress it’s been a pleasant experience. Moreso than some enterprise backup tools that I won’t name at this time. Well, okay, it’s nicer than reconfiguring devices in EMC NetWorker, or getting any kind of verbose logging out of Symantec Backup Exec.

There are some straightforward documents on the website that will get you started. The first thing you should look at is the Installation Guide. The next thing you should be looking at is the User’s Manual, and, well, you should probably consider reading up on the whole recovery process too.

The existing nodes already had an older version of esXpress installed, so a lot of the initial setup (ie the hard bit) had been done for me already. Hooray for other people doing work and saving me time!

To upgrade your version of esXpress, you need to uninstall (the old version) and re-install (the new version) rpm. This can be achieved by running the following commands:

rpm –e esxpress
rpm –e esxpressVBA

This will remove the application but not delete your backups.

Then install the new version of esXpress and the VBA and import the previous configuration. Running phd-import will import the previous esXpress settings, including licenses and so on. It can also import settings from another host, which will save some time.

It’s good stuff, and it works, so check it out.

What have I been doing? – Part 1

I recently had the “pleasure” of working on a project before Christmas that had a number of, er, interesting elements involved.  During the initial scoping, the only thing mentioned was two new arrays (with MirrorView/Asynchronous), a VMware ESX upgrade, and a few new ESX hosts.  But here’s what there really was:

– 4 NetWare 6.5 hosts in 2 NCS clusters;
– An EMC CLARiiON CX200 (remember them?) hosting a large amount (around 5TB) of NetWare and VMware data;
– A single McData switch running version 7 firmware;
– 2 new Dell hosts with incompatible CPUs with the existing 2950 hosts;
– A memory upgrade to the two existing nodes that meant one host had 20GB and the other had 28GB;
– A MirrorView target full of 1TB SATA-II spindles;
– A DR target with only one switch;
– Singley-attached (ie one HBA) hosts everywhere;
– An esXpress installation that needed to be upgraded / re-installed;
– A broken VUM implementation.

Hmmm, sound like fun? It kind of was, just because some of the things I had to do to get it to work were things I wouldn’t normally expect to do.  I don’t know whether this is such a good thing.  There’re a number of things that popped up during the project, each of which would benefit from dedicated blog posts.  But given that I’m fairly lazy, I think I’ll try and cram it all into one post.

Single switches and single HBAs are generally a bad idea

<rant> When I first started working on SANs about 10 minutes ago, I was taught that redundancy in a mid-range system is a good thing. The components that go into your average mid-range systems, while being a bit more reliable than your average gamedude’s gear, are still prone to failure. So you build a level of redundancy into the system such that when, for whatever reason, a component fails (such as a disk, fibre cable, switch or HBA), the system stays up and running. On good systems, the only people who know there’s a failure are the service personnel called out to replace the broken component in question. On a cheapy system, like the one you keep the Marketing Department’s critical morning tea photos on, a few more people might know about it. Mid-range disk arrays can run into the tens and hundreds of thousands of dollars, so sometimes people think that they can save a bit of cash but cutting a few corners by, for example, leaving the nodes with single HBAs, or having only one switch at the DR site, or using SATA as a replication target. But I would argue that, given your spending all of this cash on a decent mid-range array, why wouldn’t you do all you can to ensure it’s available all the time? Saying “My cluster provides the resiliency / We’re not that mission critical / I needed to shave $10K off the price” strikes me as counter-intuitive to the goal of providing reliable, available and sustainable infrastructure solutions. </rant>

All that said, I do understand that sometimes the people making the purchasing decisions aren’t necessarily the best-equipped people to understand the distinction between single- and dual-attached hosts, and what good performance is all about. All I can suggest is that you start with a solid design, and do the best you can to keep that design through to deployment. So what should you be doing? For a simple FC deployment (let’s assume two switches, one array, two HBAs per host), how about something like this?


Notice that there’s no connection between the two FC switches here. That’s right kids, you don’t want to merge these fabrics. The idea is that if you munt the config on one switch, it won’t automatically pass that muntedness on to the peer switch. This is a good thing if you, like me, like to do zoning from the CLI but occassionally forget to check the syntax and spelling before you make changes. And for the IBM guys playing at home, the “double redundant loops” excuse doesn’t apply to the CLARiiON. So do yourself a favour, and give yourself 4 paths to the box, dammit!

And don’t listen to Apple and get all excited about just using one switch either – not that they’re necessarily saying that, of course … Or that they’re necessarily saying anything much at all about storage any more, unless Time Capsules count as storage. But I digress …