I’ve been a bit behind on my VNX OE updates, and have only recently read docu59127_VNX-Operating-Environment-for-Block-05.33.000.5.102-and-for-File-220.127.116.11,-EMC-Unisphere-18.104.22.168.0096-Release-Notes covering VNX OE 5.33…102. Checking out the fixed problems, I noticed the following item.
The problem, you see, came to light some time ago when a few of our (and no doubt other) VNX2 customers started having disk failures on reasonably busy arrays. EMC have a KB on the topic on the support site – VNX2 slow disk rebuild speeds with high host I/O (000187088). To quote EMC “The code has been written so that the rebuild process is considered a lower priority than the Host IO. The rebuild of the new drive will take much longer if the workload from the hosts are high”. Which sort of makes sense, because host I/O is a pretty important thing. But, as a number of customers pointed out to EMC, there’s no point prioritising host I/O if you’re in jeopardy of having a data unavailable or data loss event because your private RAID groups have taken so long to complete.
Previously, the solution was to “[r]educe the amount of host I/O if possible to increase the speed of the drive rebuild”. Now, however, updated code comes to the rescue. So, if you’re running a VNX2, upgrade to the latest OE if you haven’t already.
I had a question about this come up this week and thought I’d already posted something about it. Seems I was lazy and didn’t. If you have access, have a look at Primus article emc311319 on EMC’s support site. If you don’t, here’s the rough guide to what it’s all about.
When a Storage Pool is created, a large number of private LUNs are bound on all the Pool drives and these are divided up between SP A and B. When a Pool LUN is created, it uses the allocation owner to determine which SP private LUNs should be used to store the Pool LUN slices. If the default and current owner are not the same as the allocation owner, the I/O will have to be passed over the CMI bus between SP, to reach the Pool Private FLARE LUNs. This is a bad thing, and can lead to higher response times and general I/O bottlenecks.
OMG, I might have this issue, what should I do? You can change the default owner of a LUN by accessing the LUN properties in Unisphere. You can also change the default owner of a LUN thusly.
naviseccli -h <SP A or B> chglun -l <metalun> -d owner <0|1>
-d owner 0 = SP A
-d owner 1 = SP B
But what if you have too many LUNs where the allocation owner sits on one SP? And when did I start writing blog posts in the form of a series of questions? I don’t know the answer to the latter question. But for the first, the simplest remedy is to create a LUN on the alternate SP and use EMC’s LUN migration tool to get the LUN to the other SP. Finally, to match the current owner of a LUN to the default owner, simply trespass the LUN to the default owner SP.
Note that this is a problem from CX4 arrays through to VNX2 arrays. It does not apply to traditional RAID Group FLARE LUNs though, only Pool LUNs.
If you’re trying to do an OE upgrade on a VNX you might get the following error after you’ve run through the “Prepare for Installation” phase.
Turns out you just need to upgrade USM to the latest version. You can do this manually or via USM. Further information on this error can be found on support.emc.com by searching for the following Primus ID: emc321171.
Incidentally, I’d just like to congratulate EMC on how much simpler it is upgrade FLARE / VNX OE nowadays than it was when I first started on FC and CX arrays. Sooo much nicer …
I’m a bit behind on my news, but just a quick note to say that FLARE 33 for the next-generation VNX (Block) has received an update to version 05.33.000.5.038. The release notes are here. Notably, a fix related to ETA 175619 is included. This is a good thing. As always, go to EMC’s Support site for more info, and talk to your local EMC people about whether it’s appropriate to upgrade.
In my previous post I talked about how error codes on a CX POST actually mean something, and can assist in your troubleshooting activities. But what does that stupid alphabet soup mean when you see a CLARiiON boot up?
Turns out all of those crazy letters mean something too, and you can refer to this PDF file I put together to work out at what point the array is failing its tests. Please note that I haven’t verified if these apply to the CX3, CX4 or VNX.
Despite that fact that I’ve written over 270 posts in the past 5 years on this blog, one of my most popular posts has been my article on CLARiiON CX700 FLARE Recovery. I’ve been assisting someone via e-mail over the past month or so who was having problems getting a CX700 he’d acquired to boot. He’s a smart guy, but hasn’t used a CLARiiON before. And I was working from muscle memory and unable to eyeball the console for myself. So it was an interesting challenge, combined with varying time zones.
Anyway, I thought it would be interesting to do one or two posts on some basic CX stuff that may or may not assist people who are doing this for the first time. This isn’t going to be a comprehensive series but rather a few notes and examples as I think of them.
In this instance, my correspondent had a terminal connection to the array, and was seeing the following output:
EndTime: 07/29/2013 04:11:59
.... Storage System Failure - Contact your Service Representative ...
Device: LCC 0 UART
FRU: STORAGE PROCESSOR
Description: LCC slot indicator Error!
Error detected when handling LCC READ command
ErrorTime: 07/29/2013 04:11:38
Basically, the key thing was that error code ending in 142. According to this list of CX error codes I dug up, it indicates some sort of problem with the LCC. What wasn’t clear until much later, unfortunately, was which SP my correspondent was connected to. It turned out that SP A was faulty and needed to be replaced. There’s also a LCC 0 UART Sub-Menu available from the Diagnostics section of the Utility Partition. You can perform LCC diagnostics at this point to verify the POST errors you’re seeing. In short, pulling SP A allowed the system to boot, and error codes mean something to someone. Please note that I haven’t verified if these apply to the CX3, CX4 or VNX.
Just a quick one to start the year off on the right note. I was installing updated Utility Partition software on our lab CX4s today and noticed that USM was a bit confused as to when it had started installing a bit of the code. Notice the Time started and Time elapsed section. Well, I thought it was amusing.
I’ve been commissioning some new CX4-960s recently (it’s a long story), and came across a few things that I’d forgotten about for some reason. If you’re running older disks, and they get replaced by EMC, there’s a good chance they’ll be a higher capacity. In our case I was creating a storage pool with 45 300GB FC disks and kept getting the following error.
This error was driving me nuts for a while, until I realised that one of the 300GB disks had, at some point, been replaced with a 450GB drive. Hence the error.
The other thing I came across was the restriction that Private LUNs (Write Intent Log, Reserved LUN Pool, MetaLUN Components) have to reside on traditional RAID Groups and can’t live in storage pools. Not a big issue, but I hadn’t really planned to use RAID Groups on these arrays. If you search for emc254739 you’ll find a handy KB article on WIL performance considerations, including this nugget “Virtual Provisioning LUNs are not supported for the WIL; RAID group-based LUNs or metaLUNs should be used”. Which clarifies why I was unable to allocate the 2 WIL LUNs I’d configured in the pool.
*Edit* I re-read the KB article and realised it doesn’t address the problem I saw. I had created thick LUNs on a storage pool, but these weren’t able to be allocated as WIL LUNs. Even though the article states “[The WIL LUNs] can either be RAID-group based LUNs, metaLUNs or Thick Pool LUNs”. So I don’t really know. Maybe it’s a VNX vs CX4 thing. Maybe not.
Just a quick note to advise that FLARE 31 for the VNX has received an update to version .716. I had a quick glance at the release notes and noted a few fixes that take care of some SP panics. If you want a bit of a giggle, go and look at the kb article for emc291837 – it’s a scream. Or maybe I’m just being cynical. As always, go to Powerlink for more info, and talk to your local EMC people about whether it’s appropriate to upgrade your VNX’s code.
A few months ago someone asked me if I had documentation on how to do FLARE upgrades on a CLARiiON. I’d taken a video last year, but realised that it used the old Navisphere Service Taskbar and covered the upgrade of a CX700 to FLARE 26. So, basically, my doco was a little out of date.
I recently had the opportunity to upgrade some of our CX4-120s to the latest release of FLARE 30 (.524), so thought it might be an opportune moment to document the process in a visual sense. Once I’d completed the articles, I realised this may have been done better with a series of videos. Maybe next time. In any case, here’s a four-part series (part 1, part 2, part 3, and part 4) on how to upgrade FLARE on a CX4 using Unisphere Service Manager. It’s a classic death-by-screenshot scenario, and I apologise in advance for the size of the files. While we’re talking about documentation, have a look through the rest of the articles page, there might be something useful there. And if you want something covered specifically, I do take requests.