VMware Lab Manager, ssmove.exe and why I don’t care

Sounds like a depressing topic, but really it’s not all bad. As I’d mentioned previously, I’ve spent a good chunk of the previous 4 months commissioning a CLARiiON CX4-960 array and migrating data from our production CX3-40f and CX700. All told, there’s about 112TB in use, and I’ve moved about 90TB so far. I’ve had to use a number of different methods, including Incremental SAN Copy, sVMotion, vmkfstools, and, finally, ssmove. For those of you who pay attention to more knowledgeable people’s blogs, Scott Lowe had a succinct but useful summary of how to use the ssmove utility here. So I had to move what amounted to about 3TB of SATA-II configs in a Lab Manager 3.0.1 environment. You can read the VMware KB article for the instructions, but ultimately it’s a very simple process. Except when it doesn’t work. By doesn’t work I mean wait for 25 hours and see no progress doesn’t work. So I got to spend about 6 hours on the phone with the Live queue, and the SR took a long time to resolve. The utility really doesn’t provide a lot in terms of logging, nor does it provide a lot of information if it’s not working but has ultimately timed out. It’s always the last 400GB that we get stuck on with data migrations, isn’t it?

The solution involved manually migrating the vmdk files and then updating the database. There’s an internal-only KB article that refers to the process, but VMware don’t really want to tell you about it, because it’s a bit hairy. Hairier stil was the fact that we only had a block replica of the environment, and rolling back would have meant losing all the changes that I’d done over the weekend. The fortunate thing is that this particular version of ssmove does a copy, not a move, so we were able to cancel the failed ssmove process and still use the original, problematic configuration. If you find yourself needing to migrate LM datastores and ssmove isn’t working for you, let me know and I can send you the KB reference for the process to do it manually.

So to celebrate the end of my involvement in the project, I thought I’d draw a graph. Preston is a lot better at graphs than I am, but I thought this one summed up quite nicely my feelings about this project.

Broken sVMotion

It’s been a very long few weeks, and I’m looking forward to taking a few weeks off. I’ve been having a lot of fun doing a data migration from a CLARiiON CX3-40f and CX700 to a shiny, new CX4-960. To give you some background, instead of doing an EMC-supported data in place upgrade in August last year, we decided to buy another 4-960 (on top of the already purchased 4-960 upgrade kit) and migrate the data manually to the new array. The only minor probelm with this is that there’s about 100TB of data in various forms that I need to get on to the new array. Sucks to be me, but I am paid by the hour.

I started off by moving VMs from one cluster to the new array using sVMotion, as there was a requirement to be a bit non-disruptive where possible. Unfortunately, on the larger volumes attached to fileservers, I had a few problems. I’ll list them, just for giggles:

There were 3 800GB volumes and 5 500GB volumes and 1 300GB volume that had 0% free space on the VMFS. And I mean 0%, not 5MB or 50MB, 0%. So that’s not cool for a few reasons. ESX likes to update journaling data on VMFS, because it’s a journaling filesystem. If you don’t give it space to do this, it can’t do it, and you’ll find volumes start to get remounted with very limited writeability. If you try to storage VMotion these volumes, you’ll again be out of luck, as it wants to keep a dmotion file on the filesystem to track any changes to the vmdk file while the migration is happening. I found my old colleague Leo’s post to be helpful when a few migrations fail, but unfortunately the symptoms he described were not the same as mine, in my case the VMs fell over entirely. More info from VMware can be had here.

If you want to move just a single volume, you try your luck with this method, which I’ve used successfully before. But I was tired, and wanted to use vmkfstools since I already had an unexpected outage and had to get something sorted.

The problem with vmkfstools is that there’s no restartable copy option – as far as I know. So when you get 80% through a 500GB file and it fails, well, that’s 400GB of pointless copying and time you’ll never get back. Multiply that out over 3 800GB volumes and a few ornery 500GB vmdks and you’ll start to get a picture of what kind of week I had.

After suffering through a number of failures, I ended up taking one node out of the 16-node cluster and placing it and its associated datastores (the ones I needed to migrate) in their own Navisphere storage group. That way, there’d be no “busy-ness” affecting the migration process (we had provisioned about 160 LUNs to the cluster at this stage and we were, obviously, getting a few “SCSI reservation conflicts” and “resource temporarily unavailable” issues). This did the trick, and I was able to get some more stuff done. now there’s only about 80TB to go before the end of April. Fun times.

And before you ask why didn’t I use SAN Copy? I don’t know, I suppose I’ve never had the opportunity to test it with ESX, and while I know that underlying technology is the same as MirrorView, I just really didn’t feel I was in a position to make that call. I probably should have just done it, but I didn’t really expect that I’d have as much trouble as I did with sVMotion and / or vmkfstools. So there you go.

Locked vmdk files

Somehow, a colleague of mine put an ESX host in a cluster into maintenance mode while VMs were still running. Or maybe it just happened to crash when she was about to do this. I don’t know how, and I’m not sure I still believe it, but I saw some really weird stuff last week. the end result was that VMs powered off ungracefully, and the host became unresponsive, and things were generally bad. We started adding VMs back to other hosts, but one VM had locked files. Check out this entry at Gabe’s Virtual World on how to address this, but basically you want to ps, grep and kill -9 some stuff.

ps -elf | grep vmname

kill -9 PID

And you’ll find that it’s probably the vmdk files that are locked, not necessarily the vmx file.

VCP410 exam pass

I passed my VCP410 exam yesterday with a score of 450. I’m pleased to have finally gotten it out of the way as, even though I had signed up for the second-shot voucher with VMware, there seem to be no free slots in the 4 testing centres in Brisbane this month. After the epic fail of my previous employer to stay afloat, I also had to pony up the AU$275 myself, so I felt a little bit more pressure than I normally would when taking one of these exams.

I found the following resources of particular use:

Brian’s list of “Useful study material”;

Duncan’s VCP 4 post;

Simon Long’s blog and practice exams;

and the VCP4 Mock Examfrom VMware on the mylearn page.

I also recommend you read through as much reference material / admin guides that you can, and remember that what you’re taught in the course doesn’t always correlate with what you see in the exam. Good luck!

vmkfstools sometimes not so nice

I’m hoping to do some posts on the vCLI when I have some time in the near future. But I was trolling through the release notes and came across this particular gem in the known issues section: 

Running vmkfstools -C does not prompt for confirmation. When you run vmkfstools -C to create a VMFS (Virtual Machine File System) on a partition that already has a VMFS on it, the command erases the existing VMFS and creates the new VMFS without prompting for confirmation.
Workaround: No workaround. Check the command carefully before running it.”

Nice! I can imagination a certain special consternation one would feel when running this on the wrong production LUN …

VMware ESX Cluster in a Box

This is very old news, but one of the neat things about ESX is that you can build clusters for testing and don’t have to shell out for a split-bus DAS or SAN space. My colleague, who’s been using ESX for good and evil (testing Veritas Cluster Server) needed to build one in our dev environment the other day. Last time I did this was to cluster VirtualCenter 2.0.2 and I was a bit rusty on the process.

So, for my reference, do this:

Create a shared vmdk to act as the quorum disk using vmkfstools (it needs to be in thick format)

vmkfstools -c 512m -a lsilogic -d thick /vmfs/volumes/datastore/quorum.vmdk

-c to create a vmdk and what size it should be
-a to specify the adapter type (lsilogic or buslogic)
-d can specify some neat things, like RDM settings, etc

Make another vmdk if you’d like to, well, share some data between the nodes.

Add the disk to the guest as an existing device and change the SCSI ID of the card to something like SCSI (1:0). This adds another SCSI adapter to the guest. Set the sharing mode on the adapter virtual or physical, depending on whether you want the cluster in the box or sharing with another physical / virtual host outside of the ESX host. If you don’t specify the -a option when you create the vmdk, it defaults to buslogic. Obviously if you want to share the disk with another physical host you can’t use a vmdk.

Do the same on the other guest / physical host / etc. Install MSCS or VCS or whatever passes as a clustering solution in your life. Enjoy. I also recommend the man page for vmkfstools – it’s an invaluable reference.