VMware Lab Manager, ssmove.exe and why I don’t care

Sounds like a depressing topic, but really it’s not all bad. As I’d mentioned previously, I’ve spent a good chunk of the previous 4 months commissioning a CLARiiON CX4-960 array and migrating data from our production CX3-40f and CX700. All told, there’s about 112TB in use, and I’ve moved about 90TB so far. I’ve had to use a number of different methods, including Incremental SAN Copy, sVMotion, vmkfstools, and, finally, ssmove. For those of you who pay attention to more knowledgeable people’s blogs, Scott Lowe had a succinct but useful summary of how to use the ssmove utility here. So I had to move what amounted to about 3TB of SATA-II configs in a Lab Manager 3.0.1 environment. You can read the VMware KB article for the instructions, but ultimately it’s a very simple process. Except when it doesn’t work. By doesn’t work I mean wait for 25 hours and see no progress doesn’t work. So I got to spend about 6 hours on the phone with the Live queue, and the SR took a long time to resolve. The utility really doesn’t provide a lot in terms of logging, nor does it provide a lot of information if it’s not working but has ultimately timed out. It’s always the last 400GB that we get stuck on with data migrations, isn’t it?

The solution involved manually migrating the vmdk files and then updating the database. There’s an internal-only KB article that refers to the process, but VMware don’t really want to tell you about it, because it’s a bit hairy. Hairier stil was the fact that we only had a block replica of the environment, and rolling back would have meant losing all the changes that I’d done over the weekend. The fortunate thing is that this particular version of ssmove does a copy, not a move, so we were able to cancel the failed ssmove process and still use the original, problematic configuration. If you find yourself needing to migrate LM datastores and ssmove isn’t working for you, let me know and I can send you the KB reference for the process to do it manually.

So to celebrate the end of my involvement in the project, I thought I’d draw a graph. Preston is a lot better at graphs than I am, but I thought this one summed up quite nicely my feelings about this project.

By the numbers – Part 1

As I mentioned in the previous post, I’ve being working on a large data migration project. After a brief hiatus, I’m back at work, and thought I’d take a moment to share what I’ve done so far.

  • Attached 56 hosts via dual paths to the new array.
  • Created 234 new zonesets.
  • Created 23 Storage groups.
  • Created 131 RAID Groups.
  • Added 26 hot spare disks.
  • Designed and provisioned 620 LUNs. This includes 52 4-component MetaLUNs.
  • Established 33 Incremental SAN Copy Sessions.

I don’t know how many sVMotions I’ve done so far, but it feels like a lot. I can’t exactly say how many TB I’ve moved yet either, but by the end we’ll have moved over 112TB of configured storage. Once I’ve finished this project – by end of June this year – I’ll tally up the final numbers and make a chart or something.

Broken sVMotion

It’s been a very long few weeks, and I’m looking forward to taking a few weeks off. I’ve been having a lot of fun doing a data migration from a CLARiiON CX3-40f and CX700 to a shiny, new CX4-960. To give you some background, instead of doing an EMC-supported data in place upgrade in August last year, we decided to buy another 4-960 (on top of the already purchased 4-960 upgrade kit) and migrate the data manually to the new array. The only minor probelm with this is that there’s about 100TB of data in various forms that I need to get on to the new array. Sucks to be me, but I am paid by the hour.

I started off by moving VMs from one cluster to the new array using sVMotion, as there was a requirement to be a bit non-disruptive where possible. Unfortunately, on the larger volumes attached to fileservers, I had a few problems. I’ll list them, just for giggles:

There were 3 800GB volumes and 5 500GB volumes and 1 300GB volume that had 0% free space on the VMFS. And I mean 0%, not 5MB or 50MB, 0%. So that’s not cool for a few reasons. ESX likes to update journaling data on VMFS, because it’s a journaling filesystem. If you don’t give it space to do this, it can’t do it, and you’ll find volumes start to get remounted with very limited writeability. If you try to storage VMotion these volumes, you’ll again be out of luck, as it wants to keep a dmotion file on the filesystem to track any changes to the vmdk file while the migration is happening. I found my old colleague Leo’s post to be helpful when a few migrations fail, but unfortunately the symptoms he described were not the same as mine, in my case the VMs fell over entirely. More info from VMware can be had here.

If you want to move just a single volume, you try your luck with this method, which I’ve used successfully before. But I was tired, and wanted to use vmkfstools since I already had an unexpected outage and had to get something sorted.

The problem with vmkfstools is that there’s no restartable copy option – as far as I know. So when you get 80% through a 500GB file and it fails, well, that’s 400GB of pointless copying and time you’ll never get back. Multiply that out over 3 800GB volumes and a few ornery 500GB vmdks and you’ll start to get a picture of what kind of week I had.

After suffering through a number of failures, I ended up taking one node out of the 16-node cluster and placing it and its associated datastores (the ones I needed to migrate) in their own Navisphere storage group. That way, there’d be no “busy-ness” affecting the migration process (we had provisioned about 160 LUNs to the cluster at this stage and we were, obviously, getting a few “SCSI reservation conflicts” and “resource temporarily unavailable” issues). This did the trick, and I was able to get some more stuff done. now there’s only about 80TB to go before the end of April. Fun times.

And before you ask why didn’t I use SAN Copy? I don’t know, I suppose I’ve never had the opportunity to test it with ESX, and while I know that underlying technology is the same as MirrorView, I just really didn’t feel I was in a position to make that call. I probably should have just done it, but I didn’t really expect that I’d have as much trouble as I did with sVMotion and / or vmkfstools. So there you go.

2009 and penguinpunk.net

It was a busy year, and I don’t normally do these type of posts, but I thought I’d try to do a year in review type thing so I can look back at the end of 2010 and see what kind of promises I’ve broken. Also, the Exchange Guy will no doubt enjoy the size comparison. You can see what I mean by that here.

In any case, here’re some broad stats on the site. In 2008 the site had 14966 unique visitors according to Advanced Web Statistics 6.5 (build 1.857). But in 2009, it had 15856 unique visitors – according to Advanced Web Statistics 6.5 (build 1.857). That’s an increase of some 890 unique visitors, also known as year-on-year growth of approximately 16.82%. I think. My maths are pretty bad at the best of times, but I normally work with storage arrays, not web statistics. In any case, most of the traffic is no doubt down to me spending time editing posts and uploading articles, but it’s nice to think that it’s been relatively consistent, if not a little lower than I’d hoped. This year (2010 for those of you playing at home), will be the site’s first full year using Google analytics, so assuming I don’t stuff things up too badly, I’ll have some prettier graphs to present this time next year. That said, MYOB / smartyhost are updating the web backend shortly so I can’t make any promises that I’ll have solid stats for this year, or even a website :)

What were the top posts? Couldn’t tell you. I do, however, have some blogging-type goals for the year:

1. Blog with more focus and frequency – although this doesn’t mean I won’t throw in random youtube clips at times.

2. Work more on the promotion of the site. Not that there’s a lot of point promoting something if it lacks content.

3. Revisit the articles section and revise where necessary. Add more articles to the articles page.

On the work front, I’m architecting the move of my current employer from a single data centre to a 2+1 active / active architecture (from a storage and virtualisation perspective). There’s more blades, more CLARiiON, more MV/S, some vSphere and SRM stuff, and that blasted Cisco MDS fabric stuff is involved too. Plus a bunch of stuff I’ve probably forgotten. So I think it will be a lot of fun, and a great achievement if we actually get anything done by June this year. I expect there’ll be some moments of sheer boredom as I work my way through 100s of incremental SAN Copies and sVMotions. But I also expect there will be moments of great excitement when we flick the switch on various things and watch a bunch of visio illustrations turn into something meaningful.

Or I might just pursue my dream of blogging about the various media streaming devices on the market. Not sure yet. In any case, thanks for reading, keep on reading, tell your friends, and click on the damn Google ads.

VCP410 exam pass

I passed my VCP410 exam yesterday with a score of 450. I’m pleased to have finally gotten it out of the way as, even though I had signed up for the second-shot voucher with VMware, there seem to be no free slots in the 4 testing centres in Brisbane this month. After the epic fail of my previous employer to stay afloat, I also had to pony up the AU$275 myself, so I felt a little bit more pressure than I normally would when taking one of these exams.

I found the following resources of particular use:

Brian’s list of “Useful study material”;

Duncan’s VCP 4 post;

Simon Long’s blog and practice exams;

and the VCP4 Mock Examfrom VMware on the mylearn page.

I also recommend you read through as much reference material / admin guides that you can, and remember that what you’re taught in the course doesn’t always correlate with what you see in the exam. Good luck!

So you’ve changed the IP address and munted something, now what?

Yesterday a colleague of mine was having some issues performing sVMotions on guests sitting in a development ESX 3.5 cluster. He kept getting an error along the lines of:

“IP address change for 10.x.x.x to 10.x.x.y not handled, SSL certificate verification is not enabled.”

They had changed the Service Console IP address of the host manually to perform some “secure” guest migrations previously (don’t ask me why – there’s always my way or the hard way), and basically the IP address of the host hadn’t been updated in the vxpa.cfg file. VMware has a 2-3 step process to reoslve the issue, which ultimately will require you to pull the host out of the cluster and re-add it to vCenter. It’s not a big deal, but it can be confusing when things seem to be working, but aren’t really. You can read more about it here.

MV/S consistency groups and multiple secondary images

I recently completed a migration for a client from a CX3-20 to a CX4-240. I’d done similar work for them in the past, moving their primary site from a CX500 to a CX4-240. This time things were simpler, as the secondary site I was working on contained primarily replicas of LUNs from the primary site. For those LUNs that weren’t mirrors, I used either Incremental SAN Copy, or sVMotion to migrate the data. The cool thing about MirrorView/Synchronous on the CLARiiON is that you can send replicas from one source to multiple (2) targets. As my client was understandably nervous about the exposure of not having current replicas if there was a problem, we decided that adding a secondary image on the CX4 and waiting for it to synchronize before removing the CX3 image would be the safest. The minor issue was that most of these LUNs were in MirrorView Consistency groups. If you’ve played with VMware’s Site Recovery Manager, you’ll know what these are. But for those of you that don’t, let me drop a little knowledge.

Consistency Groups are in essence a group of secondary images on a CLARiiON that are treated as a single entity. As a result of this treatment, the remote images are consistent, but may contain information that is ever so slightly older than information on the primary images. The thinking behind this is fairly obvious – you don’t want your SQL database LUN to be out of sync with the SQL logs LUN, in the same way that you wouldn’t want the Exchange mailbox LUN to be out of sync with the transaction logs. That would be silly. And it would make recovery in the event of a site failure really difficult. And thus Consistency Groups were born. Hooray.

But there are a few rules that you need to follow with Consistency Groups:

  • Up to 16 groups per CLARiiON;
  • Up to 16 mirrors per group;
  • Primary image LUNs must all be on the same CLARiiON;
  • Secondary image LUNs must all be on the same CLARiiON;
  • Images must be of the same type (you can’t mix sync and async in the same group);
  • Operations happen on all mirrors at the same time.

As I mentioned previously, you can have 1:2 LUN:replica ratio when using MV/S. Unfortunately, when Consistency Groups are in use, the ratio goes back to 1:1. So our precautionary strategy for replica migration was already under pressure. As well as this, we couldn’t have multiple CLARiiONs providing replica images in the same consistency group. The only option that my tired brain could really think of was to remove the replicas from the Consistency Groups, re-sync the new replicas with the primary copies, and then add the new replicas into the Consistency Groups. By using this methodology, we risked having inconsistent volumes on a host, but we didn’t have the risk of not having replicas at all. If any one has a better approach, I’d love to hear about it.

sVMotion with snapshot bad

You know when it says in the release notes, and pretty much every forum on the internet, that doing sVMotion migrations with snapshots attached to a vmdk is bad? Turns out they were right, and you might just end up munting your vmdk file in the process. So you might just need this link to recreate the vmdk. You may find yourself in need of this process to commit the snapshot as well. Or, if you’re really lucky, you’ll find yourself with a vmsn file that references a missing vmdk file. Wow, how rad! To work around this, I renamed the vmsn to .old, ran another snapshot, and then committed the snapshots. I reiterate that I think snapshots are good when you’re in a tight spot, in the same way that having a baseball bat can be good when you’re attacked in your home. But if you just go around swinging randomly, something’s going to get broken. Bad analogy? Maybe, but I think you get what I’m saying here.

To recap, when using svmotion.pl with VIMA, here’s the syntax:

svmotion.pl --datacenter=network.internal --url=https://virtualcenter.network.internal/sdk --username=vcadmin --vm="[VMFS_02] host01/host01.vmx:VMFS_01"

Of course, my preferred method is here:

svmotion --interactive