It’s been a very long few weeks, and I’m looking forward to taking a few weeks off. I’ve been having a lot of fun doing a data migration from a CLARiiON CX3-40f and CX700 to a shiny, new CX4-960. To give you some background, instead of doing an EMC-supported data in place upgrade in August last year, we decided to buy another 4-960 (on top of the already purchased 4-960 upgrade kit) and migrate the data manually to the new array. The only minor probelm with this is that there’s about 100TB of data in various forms that I need to get on to the new array. Sucks to be me, but I am paid by the hour.
I started off by moving VMs from one cluster to the new array using sVMotion, as there was a requirement to be a bit non-disruptive where possible. Unfortunately, on the larger volumes attached to fileservers, I had a few problems. I’ll list them, just for giggles:
There were 3 800GB volumes and 5 500GB volumes and 1 300GB volume that had 0% free space on the VMFS. And I mean 0%, not 5MB or 50MB, 0%. So that’s not cool for a few reasons. ESX likes to update journaling data on VMFS, because it’s a journaling filesystem. If you don’t give it space to do this, it can’t do it, and you’ll find volumes start to get remounted with very limited writeability. If you try to storage VMotion these volumes, you’ll again be out of luck, as it wants to keep a dmotion file on the filesystem to track any changes to the vmdk file while the migration is happening. I found my old colleague Leo’s post to be helpful when a few migrations fail, but unfortunately the symptoms he described were not the same as mine, in my case the VMs fell over entirely. More info from VMware can be had here.
If you want to move just a single volume, you try your luck with this method, which I’ve used successfully before. But I was tired, and wanted to use vmkfstools since I already had an unexpected outage and had to get something sorted.
The problem with vmkfstools is that there’s no restartable copy option – as far as I know. So when you get 80% through a 500GB file and it fails, well, that’s 400GB of pointless copying and time you’ll never get back. Multiply that out over 3 800GB volumes and a few ornery 500GB vmdks and you’ll start to get a picture of what kind of week I had.
After suffering through a number of failures, I ended up taking one node out of the 16-node cluster and placing it and its associated datastores (the ones I needed to migrate) in their own Navisphere storage group. That way, there’d be no “busy-ness” affecting the migration process (we had provisioned about 160 LUNs to the cluster at this stage and we were, obviously, getting a few “SCSI reservation conflicts” and “resource temporarily unavailable” issues). This did the trick, and I was able to get some more stuff done. now there’s only about 80TB to go before the end of April. Fun times.
And before you ask why didn’t I use SAN Copy? I don’t know, I suppose I’ve never had the opportunity to test it with ESX, and while I know that underlying technology is the same as MirrorView, I just really didn’t feel I was in a position to make that call. I probably should have just done it, but I didn’t really expect that I’d have as much trouble as I did with sVMotion and / or vmkfstools. So there you go.