ESXi 4.1 network weirdness – Part 3

So the only real update we’ve had on this ESXi problem is that it’s a bug. But I can’t give you a KB to reference, because, well, there isn’t one. And we’re not alone in feeling some pain about this, as Erik Zandboer noted on his blog. So I guess at this stage we’ll seriously reconsider if and how we use host profiles in the environment and wait for the new version of ESXi. It’s underwhelming but there you go.

ESXi 4.1 network weirdness – Part 2

In my previous post I discussed some strange behaviour we’d observed in our PoC lab with ESXi 4.1 and the workarounds I’d used to resolve the immediate issues we had with connectivity. We spoke to our local VMware SE and his opinion was pictures or it didn’t happen (my words, not his). In his defence, we had no logs or evidence that the problem happened. Fortunately, we had the problem occur again while doing some testing with OTV.

We took a video of what happened – I’ll post that in the next week when I’ve had time to edit it to a reasonable size. In the meantime my colleague posted on the forums and also raised an SR with VMware. You can find the initial posts we looked at here and here. You can find the post my colleague put up here.

The response from VMware was the best though …

Thank you for your Support Request.

As we discussed in the call, this is a known issue with the product ESXi 4.1, ie DCUI shows the IP of vmk0 as Management IP, though vmk0 is not configure for Management traffic.

Vmware product engineering team identified the root cause and fixed the issue in the upcoming major release for ESXi. As of now we don’t have ETA on release for this product.

The details on product releases are available in http://www.vmware.com.

Based on our last communication, it appears that this case is ready to be closed. I will proceed with closing your Support Request today.

Regards,”

I’ll put the video up shortly to demonstrate the symptoms, and hopefully will have the opportunity to follow this up with our TAM this week some time :)

Mozy avoids further tirades – uses personal touch …

So my hat goes off to the Mozy support and marketing people – they are extremely good at turning a PR debacle into a positive customer experience. After my little rant about credit card issues and surly response to the patient frontline support person (Steve), I was contacted by Mozy’s UK Support people via Twitter asking for my number so we could talk it over. In the meantime, the L1 support technician had gotten back to me via a case update to say that it was a Mozy issue and they were looking into it further.

Then at 11pm last night a nice support manager from Mozy UK (Ireland) named Damien rang me to discuss the issue and apologise for any inconvenience caused. He’d manually sorted out my account and I was all good to go. The short of it was that one server wasn’t talking to another and that’s why the system was doing rude things with my account.

I think the point here is that I wasn’t frustrated with the Mozy product’s performance this time round, but with the system-generated e-mails that seemed to ignore my responses. I give extra credit to Mozy for the quick response, multiple methods of communication, and the icing on the cake was a phone call and follow up e-mail. It’s not often that we get to air our frustrations and have someone respond personally to say that they’re on it and they’re sorting it out.

MozyHome cloudfail …

So Mozy opened a support case to let me know that my account has been disabled. Bad move.

So I figured I’d tell them what I thought.

I didn’t think I was being too mean. I’ve updated my CC details again, but I expect that this adventure is only just beginning.

Really Mozy? But I gave you my details 3 times already …

Mozy has decided to threaten my account with suspension, because they keep losing my credit card details, and I keep putting them in.

I think someone’s going to get a rude e-mail soon …

EMC PowerPath 5.4 SPx, ESXi 4.1 and HP CIM offline bundle

If you find yourself having problems registering EMC PowerPath 5.4.1 (unsupported) or 5.4.2 (supported) on your HP blades running ESXi 4.1, consider uninstalling the HP offline bundle hpq-esxi4.luX-bundle-1.0. We did, and PowerPath was magically able to talk to the ELM server and retrieve served licenses. I have no idea why CIM-based tools would have this effect, but there you go. Apparently a fix is on the way from HP, but I haven’t verified that yet. I’ll update as soon as I know more.

Broken sVMotion

It’s been a very long few weeks, and I’m looking forward to taking a few weeks off. I’ve been having a lot of fun doing a data migration from a CLARiiON CX3-40f and CX700 to a shiny, new CX4-960. To give you some background, instead of doing an EMC-supported data in place upgrade in August last year, we decided to buy another 4-960 (on top of the already purchased 4-960 upgrade kit) and migrate the data manually to the new array. The only minor probelm with this is that there’s about 100TB of data in various forms that I need to get on to the new array. Sucks to be me, but I am paid by the hour.

I started off by moving VMs from one cluster to the new array using sVMotion, as there was a requirement to be a bit non-disruptive where possible. Unfortunately, on the larger volumes attached to fileservers, I had a few problems. I’ll list them, just for giggles:

There were 3 800GB volumes and 5 500GB volumes and 1 300GB volume that had 0% free space on the VMFS. And I mean 0%, not 5MB or 50MB, 0%. So that’s not cool for a few reasons. ESX likes to update journaling data on VMFS, because it’s a journaling filesystem. If you don’t give it space to do this, it can’t do it, and you’ll find volumes start to get remounted with very limited writeability. If you try to storage VMotion these volumes, you’ll again be out of luck, as it wants to keep a dmotion file on the filesystem to track any changes to the vmdk file while the migration is happening. I found my old colleague Leo’s post to be helpful when a few migrations fail, but unfortunately the symptoms he described were not the same as mine, in my case the VMs fell over entirely. More info from VMware can be had here.

If you want to move just a single volume, you try your luck with this method, which I’ve used successfully before. But I was tired, and wanted to use vmkfstools since I already had an unexpected outage and had to get something sorted.

The problem with vmkfstools is that there’s no restartable copy option – as far as I know. So when you get 80% through a 500GB file and it fails, well, that’s 400GB of pointless copying and time you’ll never get back. Multiply that out over 3 800GB volumes and a few ornery 500GB vmdks and you’ll start to get a picture of what kind of week I had.

After suffering through a number of failures, I ended up taking one node out of the 16-node cluster and placing it and its associated datastores (the ones I needed to migrate) in their own Navisphere storage group. That way, there’d be no “busy-ness” affecting the migration process (we had provisioned about 160 LUNs to the cluster at this stage and we were, obviously, getting a few “SCSI reservation conflicts” and “resource temporarily unavailable” issues). This did the trick, and I was able to get some more stuff done. now there’s only about 80TB to go before the end of April. Fun times.

And before you ask why didn’t I use SAN Copy? I don’t know, I suppose I’ve never had the opportunity to test it with ESX, and while I know that underlying technology is the same as MirrorView, I just really didn’t feel I was in a position to make that call. I probably should have just done it, but I didn’t really expect that I’d have as much trouble as I did with sVMotion and / or vmkfstools. So there you go.

CLARiiON CX700 FLARE Recovery – Part 3

In the previous series of posts, I covered the despair and ultimate joy of getting to a point where I could recover a munted CLARiiON CX700. In this post, I’ll cover the process to recover the array to a working state, and the steps required to get the array functioning at a useful level.

Having successfully performed a Utility Partition Boot, it’s ecessary to get the LAN service ports on the array configured in order to be able to ftp the recovery image to the array. Obviously, you’ll need the array and your service laptop plugged into a network-type thing that will enable frank communication between the arrays and you.

===============================================================================
CLARiiON Utility Toolkit Main Menu
===============================================================================
1) About the Utility Toolkit
2) About this Array
3) Reset Storage Processor
4) Image Repository Sub-Menu
5) Plugin Sub-Menu
6) NVRAM Sub-Menu
7) Enable LAN Service Port
8) Enable Engineering Mode
9) Install Images

Enter Option: 7
===============================================================================
Please enter the network settings you wish to use for this SP
===============================================================================
IP Address:  192.168.0.2
Subnet Mask:  255.255.255.0
Default Gateway:  192.168.0.255
Host Name:  spa
Domain Name:  

===============================================================================
Confirm Network Settings
===============================================================================
IP Address:      192.168.0.2  
Subnet Mask:     255.255.255.0
Default Gateway: 192.168.0.255
Host Name:       spa          
Domain Name:                  

Enable LAN Service Port with these settings? y/n [y] 
The LAN Service Port has been enabled

Automatically enable the LAN Port with these settings in the future? y/n [y] n

Press the Enter key to continue… 

Once you’ve enabled the LAN port on the SP you’re connected to, you need to ftp the image to the SP’s repository. The username to use is Clariion, and the password is clariion!. Once you’ve logged in, run a put command to put the file up there. It doesn’t really matter what you call it, but it should be a file of type mif. Here’s a pointless text capture of the ftp login process:

C:\>ftp 192.168.0.2
Connected to 192.168.0.2.
220-FileZilla Server version 0.8.3 beta test release 1
220-written by Tim Kosse (
[email protected])
220 Please visit
http://sourceforge.net/projects/filezilla/
User (192.168.0.2:(none)): Clariion
331 Password required for clariion
Password:
230 Logged on
ftp> ls
200 Port command successful
150 Opening data channel for directory list.
FLARE.mif
226 Transfer OK
ftp: 11 bytes received in 0.00Seconds 11000.00Kbytes/sec.
ftp>

Once you’ve successfully uploaded the recovery image, you’ll be good to go. It’s also important to note that the FLARE recovery image should be for the release that you intend to run. I didn’t consider uploading a Release 19 image, as I knew that these arrays had run Release 26 previously. In any case, jumping back into the Image menu on the terminal, it’s now time to copy the image from the RAM disk and then load it.

===============================================================================
CLARiiON Utility Toolkit Main Menu
===============================================================================
1) About the Utility Toolkit
2) About this Array
3) Reset Storage Processor
4) Image Repository Sub-Menu
5) Plugin Sub-Menu
6) NVRAM Sub-Menu
7) View LAN Service Port Settings
8) Enable Engineering Mode
9) Install Images

Enter Option: 4

===============================================================================
CLARiiON Utility Toolkit Image Repository Menu
===============================================================================
1) Back to the Main Menu
2) List Image Repository Contents
3) Delete Files from the Image Repository
4) Copy Files from the RAM Disk to the Image Repository
5) Copy Files from the Image Repository to the RAM Disk

Enter Option: 4

===============================================================================
Select files to copy to the Image Repository
===============================================================================
1) FLARE.mif

Enter comma separated list of options: 1

Copying FLARE.mif to the Image Repository… Success

Press the Enter key to continue…
 
===============================================================================
CLARiiON Utility Toolkit Image Repository Menu
===============================================================================
1) Back to the Main Menu
2) List Image Repository Contents
3) Delete Files from the Image Repository
4) Copy Files from the RAM Disk to the Image Repository
5) Copy Files from the Image Repository to the RAM Disk

Enter Option: 1

===============================================================================
CLARiiON Utility Toolkit Main Menu
===============================================================================
1) About the Utility Toolkit
2) About this Array
3) Reset Storage Processor
4) Image Repository Sub-Menu
5) Plugin Sub-Menu
6) NVRAM Sub-Menu
7) View LAN Service Port Settings
8) Enable Engineering Mode
9) Install Images

Enter Option: 9

===============================================================================
Select Images to Install
===============================================================================
1) FLARE.mif

Enter comma separated list of options: 1

===============================================================================
Confirm Image Installation
===============================================================================

  FLARE.mif

You need to install this only on the SP that you have visibility of, as troubleshooting the installation to both SPs is tricky.

Are you sure you want to install these images? y/n [n] y

===============================================================================
Select Storage Processors to install images for
===============================================================================
1) This SP (SP A)
2) Peer SP (SP B)
3) Both SP’s

Enter Option: 1
 
Installing Data Directory Boot Service 02.12
0%..10%..20%..30%..40%..50%..60%..70%..80%..90%..100%
|—-|—-|—-|—-|—-|—-|—-|—-|—-|—-|  
***************************************************
The COPY operation has completed successfully.

Installing BIOS 03.70
0%..10%..20%..30%..40%..50%..60%..70%..80%..90%..100%
|—-|—-|—-|—-|—-|—-|—-|—-|—-|—-|  
***************************************************
The COPY operation has completed successfully.

Installing Extended POST 02.38
0%..10%..20%..30%..40%..50%..60%..70%..80%..90%..100%
|—-|—-|—-|—-|—-|—-|—-|—-|—-|—-|  
***************************************************
The COPY operation has completed successfully.

Installing FLARE Image 02.26.700.5.005
0%..10%..20%..30%..40%..50%..60%..70%..80%..90%..100%
|—-|—-|—-|—-|—-|—-|—-|—-|—-|—-|  
***************************************************
The COPY operation has completed successfully.

Press the Enter key to continue… 

Once the copy has completed successfully, the system needs to be reset, and you’ll see the SP reboot up to three times before it’s useable.

===============================================================================
CLARiiON Utility Toolkit Main Menu
===============================================================================
1) About the Utility Toolkit
2) About this Array
3) Reset Storage Processor
4) Image Repository Sub-Menu
5) Plugin Sub-Menu
6) NVRAM Sub-Menu
7) View LAN Service Port Settings
8) Enable Engineering Mode
9) Install Images

Enter Option: 3

Requesting System Reset         

Once this is complete, you can either load the recovery image to the other SP via Navisphere in Engineering Mode, or you can use the same method as described above. Note that, once the image is copied to the repository, it is not necessary to re-upload it, as both SPs have access to the files.

The normal process needs to be followed as would normally be followed to initialise an array. In my case I initialised security, setup IP addresses for the SPs, logged in, committed the FLARE (R26.005), enabled Access Logix, configured cache settings, upgraded FLARE to the latest (R26.028), reloaded the latest Utility Partition and Recovery Images, and went about loading the appropriate enablers for the array. And now we have a working lab :)

CLARiiON CX700 FLARE Recovery – Part 2

In Part 1 of this series I talked about what looked like a borked CX700 and some of the options available to me to get it up and running again. I trawled powerlink looking for solutions and came across a number of articles that talked about ordering Vault packs and getting EMC CEs involved. As the arrays were no longer under maintenance, I didn’t have high hopes that this would be a process we could undertake.

My trawling for solutions, however, did yield a rather interesting nugget of information. For those of you with access to Powerlink, there’s an article entitled “CX700 array unmanaged and fails to display its serial number after changing WWN seed array“. This article also goes by the ID emc119598 and discusses the process to rectify the array’s WWN seed after a conversion from a CX500 to a CX700. The great thing about this article was not so much the solution provided as the alternative method described to access the CLARiiON’s Diagnostics Menu. To wit, using the password “SHIP_it” yields a menu subsystem that is dramatically different from the one provided with “DB_key“. The results are below, the full transcript can be downloaded here:

Copyright (c) EMC Corporation , 2007
Disk Array Subsystem Controller
Model: CX700: SAN GBFCC4
DiagName: Extended POST
DiagRev: Rev. 02.39
Build Date: Fri Jul 13 16:36:03 2007
StartTime: 01/21/2010 22:04:18
SaSerialNo: LKE00051202843

AabcdefgBC

EndTime: 01/21/2010 22:04:19
…. Storage System Failure – Contact your Service Representative …

*******
******  Aborting!!!!  ******

 
Hit ESC to begin running diagnostic menu…

Entering the alternative password, we see the following output:

Diagnostic Menu
1)  Reset Controller          21) BE1 FCC Sub-Menu
2)  Enter Debugger            22) CMI0 FCC Sub-Menu
3)  Display Warnings/Errors   23) CMI1 FCC Sub-Menu
4)  Boot OS                   24) AUX0 FCC Sub-Menu
5)  POST Sub-Menu             25) AUX1 FCC Sub-Menu
6)  Display/Change Privilege  26) FE0 FCC Sub-Menu
7)  Boot UProc Sub-Menu       27) FE1 FCC Sub-Menu
8)  Ap UProc Sub-Menu         28) FE2 FCC Sub-Menu
9)  Real Time Clock Sub-Menu  29) FE3 FCC Sub-Menu
10) Pers. Module Sub-Menu     30) POST ROM Sub-Menu
11) RAM Sub-Menu              31) BIOS ROM Sub-Menu
12) NOVRAM Sub-Menu           32) System Test Sub-Menu
13) Console UART Sub-Menu     33) Image Sub-Menu
14) SPS UART Sub-Menu         34) Disk Sub-Menu
15) LCC 0 UART Sub-Menu       35) Resume PROM Sub-Menu
16) LCC 1 UART Sub-Menu       36) Voltage Margin Sub-Menu
17) LCC 2 UART Sub-Menu       37) Information Display
18) LCC 3 UART Sub-Menu       38) ICA Sub-Menu
19) LAN Service Port Sub-Menu 39) DDBS Service Sub-Menu
20) BE0 FCC Sub-Menu          40) FCC Boot Sub-Menu

Enter Option : 33

Option 33 is what we’re interested in to start with. From here you can perform a Utility Partition Boot.

Image Sub-Menu
1)  Init Loop                 6)  Exit Loop
2)  Serial Download           7)  Relocate/Run Image
3)  Load from disk            8)  Display Sector Protection
4)  Save to disk              9)  Utility Partition Boot
5)  Update Firmware

                    0) Exit

Enter Option :  9
Relocating Data Directory Boot Service (DDBS: Rev. 02.12)…
DDBS: K10_REBOOT_DATA: Count = 0
DDBS: K10_REBOOT_DATA: State = 0
DDBS: K10_REBOOT_DATA: ForceDegradedMode = 0

DDBS: Read default MDDE off disk 1
DDBS: MDDE (Rev 2) on disk 1
DDBS: Read default DDE (0x40000F) off disk 1
DDBS: Read default MDDE off disk 3
DDBS: MDDE (Rev 2) on disk 3
DDBS: Read default DDE (0x400010) off disk 3

DDBS: MDB read from both disks.
DDBS: DD invalid on both disks, continuing…
DDBS: Disk WWN seeds match each other but not chassis WWN seed.
DDBS: First disk is valid for boot.
DDBS: Second disk is valid for boot.

[snip]

int13 – GET DRIVE PARAMETERS (Extended) (1437)
ICA::UtilityFrontEnd
(c) EMC Corporation 2001-2004 All Rights Reserved
DiagName: ICA::UtilityFrontEnd
DiagRev: 02.16.700.5.001
StartTime: 01/21/10 22:08:37

OS Type……………………….WinXP
SMBUS…………………………Running
SPID………………………….Running
ASIDC…………………………Running
ASIRAMDisk…………………….Running
ICA…………………………..Running
FileZilla Server……………….Running
Connecting to ICA………………Success
SP Type……………………….CX700
SP ID…………………………A
SP Signature…………………..0x08291953
Checking Image Repository……….
    ICA::IRFS no valid Volume was found on this system
    ICA::IRFS Creating new Volume
    ICA::IRFS Finished creating new volume
    ICA::IRFS Checking Volume for consistency
Sizing Image Repository…………1024 MB
Sizing RAM Disk………………..2039 MB
Discovering Management LAN Port….ManagementPort0
Checking LAN Port State…………Not Configured
Checking LAN Port Config………..Not Found
Loading Plugins………………..Done

EndTime: 01/21/10 22:09:03

Now that’s what I wanted to see :) – from here we just need to reload the FLARE image with ftp. I’ll cover this in Part 3.

CLARiiON CX700 FLARE Recovery – Part 1

I’ve broken this post into three parts not because I’m trying to being a tease, but rather so it’s a little easier to digest and you can head straight for the money shot if you so desire. Part 1 covers the intitial failure and subsequent troubleshooting. Part 2 covers the eventual workaround to the problem. Part 3 covers the work that needed to be done once the problem was resolved.

It’s strange to think of EMC’s CX700 CLARiiON array as a “legacy” array. Yet it’s now two generations behind EMC’s flagship mid-range array – the CX4-960. Our project was given access to two CX700s to use as test arrays for a multi-site data centre project we’re working on. That’s cool, as the CX700 is still a reasonably well-specced array, with multiple back-end loops and the a fair bit of useable cache (at least compared to the CX4-120). So after the data centre guys Macguyvered the kit into racks that were too big for the rails, we cabled the lot up and thought it would be a fairly trivial process to get everything up and running.

As usual, I was wrong. The department that had provided these hand-me-down arrays had bought a service from EMC whereby the data was securely erased. For those of you playing at home, this is known as the “Certified Data Erasure Service“, and you can grab the datasheet from here. So basically, these arrays were saved from the scrapheap, but not before they were rendered basically unbootable.

When we powered them up, we got the following output via the terminal:

Disk 2 Read Error 0x00000187
Number Sectors: 1
LBA: 0x0002284B
Buffer: 0x1000A114

DDBS: Can’t read MDB from first disk.
DDBS: Can’t read MDB from second disk.
DDBS: Using first disk for boot – but inaccessible.

FLARE image (0x00400007) located at sector LBA 0x0002284C

Disk Set: 0
ErrorCode: 0x0000018D
ErrorDesc:
Device: BOOT PATH
FRU: STORAGE PROCESSOR
Description: Dual-Mode Fibre Driver Exchange Error!
DualMode Driver Exchange Status: 0x1000000C
Target ID: 0x00
EndError:
ErrorTime: 01/19/2010 05:07:11

(the full boot log can be found here)

So I tried booting to the utility partition via the DDBS submenu. You would be familiar with this submenu if you’ve ever performed an out-of-familty array conversion, it’s where you go to run the conversion image over the new array and tell FLARE that it’s brain is, er, bigger. In any case, this can be accessed by pressing ESC during the initial POST and then typing in “DB_key“. Note that on newer CX3 and CX4 arrays, you don’t press ESC anymore, but rather CTRL-C is used to break the boot. You’ll then be presented with menus that look something like this:

Copyright (c) EMC Corporation , 2007
Disk Array Subsystem Controller
Model: CX700: SAN GBFCC4
DiagName: Extended POST
DiagRev: Rev. 02.39
Build Date: Fri Jul 13 16:36:03 2007
StartTime: 01/19/2010 05:38:05
SaSerialNo: LKE00051202843

AabcdefgBCDEabFabcdGHabIabcJabKabLab

EndTime: 01/19/2010 05:38:20
…. Storage System Failure – Contact your Service Representative …

******
******  Aborting!!!!  ******

 
Hit ESC to begin running diagnostic menu…

                Diagnostic Menu
1)  Reset Controller          3)  DDBS Service Sub-Menu
2)  Display Warnings/Errors   4)  FCC Boot Sub-Menu

Enter Option : 3

So I select option 3, and then attempt the Utility Partition Boot and get the following:

1)  Drive Slot ID Check       2)  Utility Partition Boot

                    0) Exit

Enter Option : 2

DDBS: K10_REBOOT_DATA: Count = 0
DDBS: K10_REBOOT_DATA: State = 0
DDBS: K10_REBOOT_DATA: ForceDegradedMode = 0

DDBS: Read default MDDE off disk 1
DDBS: MDDE (Rev 2) on disk 1
DDBS: Read default DDE (0x40000F) off disk 1
DDBS: Read default MDDE off disk 3
DDBS: MDDE (Rev 2) on disk 3
DDBS: Read default DDE (0x400010) off disk 3

DDBS: Can’t read MDB from first disk.
DDBS: Can’t read MDB from second disk.
DDBS: Using first disk for boot – but inaccessible.

Utility Partition image (0x0040000F) located at sector LBA 0x00BE804C

Disk Set: 1
ErrorCode: 0x00000187
ErrorDesc:
Device: DIAG MENU
FRU: STORAGE PROCESSOR
Description: Disk not logged in Error!
Target ID: 0x01
Targets Found: 0xF000FF53
EndError:
ErrorTime: 01/19/2010 05:39:52

(the full boot log can be found here)

Okay, so that’s not cool. I had hoped that I would be able to boot from the Utility Partition, because the process to load the Recovery Image either from the repository or via ftp is fairly simple. At this point we started to think of a number of whacky alternatives that could be used, including, but not limited to, reconstructing the FLARE disks from another CX700’s hot spares, using the Vault disks from a CX300 and performing an in-place conversion to a CX700, and begging and pleading with our local EMC office for a Vault pack. None of these options really struck us as awesome ideas. Read Part 2 for the solution to the problem.