Storage – Erasure Coding and RAID – A Few Good Links

Erasure coding has been around for a little while now, and if you’ve ever sat through a presentation from a cloud storage provider talking about resiliency of data at scale, you may have heard it mentioned. It occurred to me that I’ve just assumed that people know what it is, and that’s not fair. I was going to do a post explaining what it is, but figured a quick post with some links to some articles I found of use would be more useful. Because what’s the point of the internet if I can’t be lazy and link to things on it?

Here’re some useful research papers to start with:

The “press” also has some useful articles on the topic. I recommend you have a look at these two:

Some of my preferred analysts have written a bit on the topic:

Josh has also done a great deep-dive on the Nutanix version of erasure coding (EC-X) that you can see here.

My favourite post, though, is this one: Dummies Guide to Erasure Coding.



QNAP – Add swap to your NAS for large volume fsck activities

That’s right, another heading from the Department of not terribly catchy blog article titles. I’ve been having a mighty terrible time with one of my QNAP arrays lately. After updating to 4.1.2, I’ve been getting some weird symptoms. For example, every time the NAS reboots, the filesystem is marked as unclean. Worse, it mounts as read-only from time to time. And it seems generally flaky. So I’ve spent the last week trying to evacuate the data with the thought that maybe I can re-initialize it and clear out some of the nasty stuff that’s built up over the last 5 years. Incidentally, while we all like to moan about how slow SATA disks are, try moving a few TB via a USB2 interface. The eSATA seems positively snappy after that.

Of course, QNAP released version 4.1.3 of their platform recently, and a lot of the symptoms I’ve been experiencing have stopped occurring. I’m going to continue down this path though, as I hadn’t experienced these problems on my other QNAP, and just don’t have a good feeling about the state of the filesystem. And you thought that I would be all analytical about it, didn’t you?

In any case, I’ve been running e2fsck on the filesytem fairly frequently, particularly when it goes read-only and I have to stop the services, unmount and remount the volume.

[/] # cd /share/MD0_DATA/
[/share/MD0_DATA] # cd Qmultimedia/    
[/share/MD0_DATA/Qmultimedia] # mkdir temp         
mkdir: Cannot create directory `temp': Read-only file system
[/share/MD0_DATA/Qmultimedia] # cd /
[/] # /etc/init.d/ stop
Stop qpkg service: chmod: /share/MD0_DATA/.qpkg: Read-only file system
Shutting down Download Station: OK
Disable QUSBCam ... 
Shutting down SlimServer... 
Error: Cannot stop, SqueezeboxServer is not running.
WARNING: rc.ssods ERROR: script /opt/ssods4/etc/init.d/K20slimserver failed.
Stopping thttpd-ssods .. OK.
rm: cannot remove `/opt/ssods4/var/run/': Read-only file system
WARNING: rc.ssods ERROR: script /opt/ssods4/etc/init.d/K21thttpd-ssods failed.
Shutting down QiTunesAir services: Done
Disable Optware/ipkg
Stop service: snmp nfs .
[/] # umount /dev/md0


So then I run e2fsck to check the filesystem. But on a large volume (in this case 8 and a bit TB), it uses a lot of RAM. And invariably runs out of swap space.

[/] # e2fsck /dev/md0
e2fsck 1.41.4 (27-Jan-2009)
/dev/md0: recovering journal
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error allocating block bitmap (4): Memory allocation failed
e2fsck: aborted


So here’s what I did to enable some additional swap on a USB stick (courtesy of a QNAP forum post from RottUlf).

Insert a USB stick with more than 3GB of space. Create a swap file on it.

[/] # dd if=/dev/zero of=/share/external/sdi1/myswapfile bs=1M count=3072

Set it as a swap file.

[/] # mkswap /share/external/sdi1/myswapfile

Enable it as swap for the system.

[/] # swapon /share/external/sdi1/myswapfile

Check it.

[/] # cat /proc/swaps
Filename Type Size Used Priority
/dev/md8 partition 530040 8216 -1
/share/external/sdi1/myswapfile file 3145720 12560 -2

You should then be able to run e2fsck. Note that the example I linked to used e2fsck_64, but this isn’t available on the TS639 Pro II. Once you’ve fixed your filesystem issues, you’ll want to disable the swap file on the stick, remount the volume and restart your services.

[/] # swapoff /share/external/sdi1/myswapfile
[/] # mount /dev/md0
mount: can't find /dev/md0 in /etc/fstab or /etc/mtab

Oh no …

[/] # mount /dev/md0 /share/MD0_DATA/

Yeah, I don’t know what’s going on there either. I’ll report back in a while when I’ve wiped it and started again.



QNAP – Increase RAID rebuild times with mdadm

I recently upgraded some disks in my TS-412 NAS and it was taking some time. I vaguely recalled playing with min and max settings on the TS-639. Here’s a link to the QNAP forums on how to do it. The key is the min setting, and, as explained in the article, it really depends on how much you want to clobber the CPU. Keep in mind, also, that you can only do so much with a 3+1 RAID 5 configuration. I had my max set to 200000, and the min was set to 1000. As a result I was getting about 20MBs, and each disk was taking a little less than 24 hours to rebuild. I bumped up the min setting to 50000, and it’s now rebuilding at about 40MBs. The CPU is hanging at around 100%, but the NAS isn’t used that frequently.

To check your settings, use the following commands:

cat /proc/sys/dev/raid/speed_limit_max
cat /proc/sys/dev/raid/speed_limit_min

To increase the min setting, issue the following command:

echo 50000 >/proc/sys/dev/raid/speed_limit_min

And you’ll notice that, depending on the combination of disks, CPU and RAID configuration, your rebuild will go a wee bit faster than before.

QNAP – How to repair RAID brokenness – Redux

I did a post a little while ago (you can see it here) that covered using mdadm to repair a munted RAID config on a QNAP NAS. So I popped another disk recently, and took the opportunity to get some proper output. Ideally you’ll want to use the web interface on the QNAP to do this type of thing but sometimes it no worky. So here you go.

Stop everything on the box.

[~] # /etc/init.d/ stop
Stop service: nfs snmp .
Stop qpkg service: Disable Optware/ipkg
Shutting down SlimServer...
Stopping SqueezeboxServer 7.5.1-30836 (please wait) .... OK.
Stopping thttpd-ssods .. OK.
/etc/rcK.d/QK107Symform: line 48: /share/MD0_DATA/.qpkg/Symform/ No such file or directory

(By the way it really annoys me when I’ve asked software to remove itself and it doesn’t cleanly uninstall – I’m looking at you Symform plugin)

Unmount the volume

[~] # umount /dev/md0

Stop the array

[~] # mdadm -S /dev/md0
mdadm: stopped /dev/md0

Reassemble the volume

[~] # mdadm --assemble /dev/md0 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 /dev/sdf3
mdadm: /dev/md0 has been started with 5 drives (out of 6).

Wait, wha? What about that other disk that I think is okay?

[~] # mdadm --detail /dev/md0
Version : 00.90.03
Creation Time : Fri May 22 21:05:28 2009
Raid Level : raid5
Array Size : 9759728000 (9307.60 GiB 9993.96 GB)
Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
Raid Devices : 6
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Wed Dec 14 19:09:25 2011
State : clean, degraded
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 7c440c84:4b9110fe:dd7a3127:178f0e97
Events : 0.4311172
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 0 0 1 removed
2 8 35 2 active sync /dev/sdc3
3 8 51 3 active sync /dev/sdd3
4 8 67 4 active sync /dev/sde3
5 8 83 5 active sync /dev/sdf3

Or in other words

[~] # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : active raid5 sda3[0] sdf3[5] sde3[4] sdd3[3] sdc3[2]
9759728000 blocks level 5, 64k chunk, algorithm 2 [6/5] [U_UUUU]
md6 : active raid1 sdf2[2](S) sde2[3](S) sdd2[4](S) sdc2[1] sda2[0]
530048 blocks [2/2] [UU]
md13 : active raid1 sdb4[2] sdc4[0] sdf4[5] sde4[4] sdd4[3] sda4[1]
458880 blocks [6/6] [UUUUUU]
bitmap: 0/57 pages [0KB], 4KB chunk
md9 : active raid1 sdf1[1] sda1[0] sdc1[4] sdd1[3] sde1[2]
530048 blocks [6/5] [UUUUU_]
bitmap: 34/65 pages [136KB], 4KB chunk
unused devices: <none>

So, when you see [U_UUUU] you’ve got a disk missing, but you knew that already. You can add it back in to the array thusly.

[~] # mdadm --add /dev/md0 /dev/sdb3
mdadm: re-added /dev/sdb3

So let’s check on the progress.

[~] # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : active raid5 sdb3[6] sda3[0] sdf3[5] sde3[4] sdd3[3] sdc3[2]
9759728000 blocks level 5, 64k chunk, algorithm 2 [6/5] [U_UUUU]
[>....................] recovery = 0.0% (355744/1951945600) finish=731.4min speed=44468K/sec
md6 : active raid1 sdf2[2](S) sde2[3](S) sdd2[4](S) sdc2[1] sda2[0]
530048 blocks [2/2] [UU]
md13 : active raid1 sdb4[2] sdc4[0] sdf4[5] sde4[4] sdd4[3] sda4[1]
458880 blocks [6/6] [UUUUUU]
bitmap: 0/57 pages [0KB], 4KB chunk
md9 : active raid1 sdf1[1] sda1[0] sdc1[4] sdd1[3] sde1[2]
530048 blocks [6/5] [UUUUU_]
bitmap: 34/65 pages [136KB], 4KB chunk
unused devices: <none>
[~] #

And it will rebuild. Hopefully. Unless the disk is really truly dead. You should probably order yourself a spare in any case.

EMC – Sometimes RAID 6 can be a PITA

This is really a quick post to discuss how RAID 6 can be a bit of a pain to work with when you’re trying to combine traditional CLARiiON / VNX DAEs and Storage Pool best practices. It’s no secret that EMC strongly recommend using RAID 6 when you’re using SATA-II / NL-SAS drives that are 1TB or greater. Which is a fine and reasonable thing to recommend. However, as you’re no doubt aware, the current implementation of FAST VP uses Storage Pools that require homogeneous RAID types. So you need multiple tools if you want to run both RAID 1/0 and RAID 6. If you want a pool that can leverage FAST to move slices between EFD, SAS, and NL-SAS, it all needs to be RAID 6. There are a couple of issues with this. Firstly, given the price of EFDs, a RAID 6 (6+2) of EFDs is going to feel like a lot of money down the drain. Secondly, if you stick with the default RAID 6 implementation for Storage Pools, you’ll be using 6+2 in the private RAID groups. And then you’ll find yourself putting private RAID groups across backend ports. This isn’t as big an issue as it was with the CX4, but it still smells a bit ugly.

What I have found, however, is that you can get the CLARiiON to create non-standard sized RAID 6 private RAID groups. If you create a pool with 10 spindles in RAID 6, it will create a private RAID groups in a 8+2 configuration. This seems to be the magic number at the moment. If you add 12 disks to the pool it will create 2 4+2 private RAID groups, and if you use 14 disks it will do a 6+2 and a 4+2 RAID group. Now, the cool thing about 10 spindles in a private RAID group is that you could, theoretically (I’m extrapolating from the VNX Best Practices document here), split the 8+2 across two DAEs in a 5+5. In this fashion, you can increase the rebuild times slightly in the event of a disk failure, and you can also draw some sensible designs that fit well in a traditional DAE4P. Of course, creating your pools in increments of 10 disks is going to be a pain, particularly for larger Storage Pools, and particularly as there is no re-striping of data done after a pool expansion. But I’m sure EMC are focussing on this issue in the future, as a lot of customers have had a problem with the initial approach. The downside to all this, of course, is that you’re going to suffer a capacity and, to a lesser extent, performance penalty by using RAID 6 across the board. In this instance you need to consider whether FAST VP is going to give you the edge over split RAID pools or traditional RAID groups.

I personally like the idea of Storage Pools, and I’m glad EMC have gotten on-board with them in their midrange stuff. I’m also reasonably optimistic that they’re working on addressing a lot of issues that have come up in the field. I just don’t know when that will be.

EMC CLARiiON VNX7500 Configuration guidelines – Part 2

In this episode of EMC CLARiiON VNX7500 Configuration Guidelines, I thought it would be useful to discuss Storage Pools, RAID Groups and Thin things (specifically Thin LUNs). But first you should go away and read Vijay’s blog post on Storage Pool design considerations. While you’re there, go and check out the rest of his posts, because he’s a switched-on dude. So, now you’ve done some reading, here’s a bit more knowledge.

By default, RAID groups should be provisioned in a single DAE. You can theoretically provision across buses for increased performance, but oftentimes you’ll just end up with crap everywhere. Storage Pools obviously change this, but you still don’t want to bind the Private RAID Groups across DAEs. But if you did, for example, want to bind a RAID 1/0 RAID Group across two buses – for performance and resiliency – you could do it thusly:

naviseccli -h <sp-ip> createrg 77 0_1_0 1_1_0 0_1_1 1_1_1

Where the numbers refer to the standard format Bus_Enclosure_Disk.

The maximum number of Storage Pools you can configure is 60. It is recommended that a pool should contain a minimum of 4 private RAID groups. While it is tempting to just make the whole thing one big pool, you will find that segregating LUNs into different pools may still be useful for FAST cache performance, availability, etc. Remember kids, look at the I/O profile of the projected workload, not just the capacity requirements. The mixing of drives with different performance characteristics in a homogenous pool is also contra-indiciated. When you create a Storage Pool the following Private RAID Group configurations are considered optimal (depending on the RAID type of the Pool):

  • RAID 5 – 4+1
  • RAID 1/0 – 4+4
  • RAID 6 – 6 + 2

Pay attention to this, because you should always ensure that a Pool’s private RAID groups align with traditional RAID Group best practices, while sticking to these numbers. So don’t design a 48 spindle RAID 5 Pool. That will be, er, non-optimal.


EMC recommend that if you’re going to blow a wad of cash on SSDs / EFDs, you should do it on FAST cache before making use of the EFD Tier.


With current revisions of FLARE 30 and 31, data is not re-striped when the pool is expanded. It’s also important to understand that preference is given to using the new capacity rather than the original storage until all drives in the Pool are at the same level of capacity. So if you have data on a 30-spindle Pool, and then add another 15 spindles to the Pool, the data goes to the new spindles first to even up the capacity. It’s crap, but deal with it, and plan your Pool configurations before you deploy them. For RAID 1/0, avoid private RAID Groups of 2 drives.

A Storage Pool on the VNX7500 can be created with or expanded by 180 drives at a time, and you should keep the increments the same. If you are considering the use of greater than 1TB drives use RAID 6. When FAST VP is working with Pools, remember that you’re limited to one type of RAID in a pool. So if you want to get fancy with different RAID Types and tiers, you’ll need to consider using additional Pools to accommodate this. It is, however, possible to mix thick and thin LUNs in the same Pool. It’s also important to remember that the consumed capacity for Pool LUNs = (User Consumed Capacity * 1.02) + 3GB. This can have an impact as capacity requirements increase.


A LUN’s tiering policy can be changed after the initial allocation of the LUN. FAST VP has the following data placement options: Lowest, Highest, Auto, no movement. This can present some problems if you want to create a 3-tier Pool. The only workaround I could come up with was to create the Pool with 2 tiers and place LUNs at highest and lowest. Then add the third tier and place those highest tier LUNs on the highest tier and change the middle tier LUNs to No Movement. What would be a better solution is to create the Pool with the tiers you want, put all of your LUNs on Auto placement, and let FAST VP sort it out for you. But if you have a lot of LUNs, this can take time.


For thin NTFS LUNs – use Microsoft’s sdelete to zero free space. When using LUN Compression – Private LUNs (Meta Components, Snapshots, RLP) cannot be compressed. EMC recommends that compression only be used for archival data that is infrequently accessed. Finally, you can’t defragment RAID 6 RAID Groups – so pay attention when you’re putting LUNs in those RAID Groups.

QNAP – How to repair RAID brokenness

I use a QNAP 639 Pro NAS at home to store my movies on. It’s a good unit and overall I’ve found it to be relatively trouble-free. I was recently upgrading the disks in the RAID set from 1TB to 2TB drives and it was going swimmingly. But then I heard a beep while the RAID was rebuilding disk 5 of 6 in the set. And I started to get some concerned e-mails from the NAS.

Server Name: qnap639
 IP Address:
 Date/Time: 2011/06/09 16:27:33
 Level:  Error
 [RAID5 Disk Volume: Drive 1 2 3 4 5 6] Error occurred while accessing Drive 3.

Server Name: qnap639
 IP Address:
 Date/Time: 2011/06/09 16:27:40
 Level:  Error
 [RAID5 Disk Volume: Drive 1 2 3 4 5 6] Error occurred while accessing the devices of the volume in degraded mode.

Server Name: qnap639
 IP Address:
 Date/Time: 2011/06/09 16:29:32
 Level:  Warning
 [RAID5 Disk Volume: Drive 1 2 3 4 5 6] Mount the file system read-only.

Server Name: qnap639
 IP Address:
 Date/Time: 2011/06/09 16:31:41
 Level:  Warning
 [RAID5 Disk Volume: Drive 1 2 3 4 5 6] Rebuilding skipped.

Basically, it looks like the NAS thought one of the disks had popped. You can see this thing all over the QNAP forums – here‘s a good example – and it’s usually because of incompatibility between the QNAP firmware and various green hard disks. but I’d checked that my disks were on the Official QNAP HCL, and, well, that couldn’t be it. So I rebooted a bunch of times and ran S.M.A.R.T. scans on the allegedly failed disk. I pulled it out and erased it on an XP box and put it back in. The NAS wanted no part of it though. So it was time to get dirty with mdadm.

Firstly, make sure there’s nothing going on on the NAS, stop the running services and unmount the RAID device.

/etc/init.d/ stop
umount /dev/md0

Once the volume’s unmounted you can stop the volume.

mdadm -S /dev/md0

Now for the bit where you hold your breath for a while – the reassembly of the volume with the components you want.

mdadm --assemble /dev/md0 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 /dev/sdf3

To see the progress, you can access a couple fo different commands.

mdadm --detail /dev/md0
cat /proc/mdstat

Once that’s complete, it’s best to run a filesystem check.

e2fsck -f /dev/md0

If there’re no errors, mount the volume and check that your stuff is still there.

mount /dev/md0 /share/MD0_DATA

I then rebooted and confirmed that everything started up correctly and my data was still there. But when I added the 6th drive, I got an error about a missing superblock and there didn’t seem to be any mdadm magic that would solve this problem. So like I good admin I rebooted, and the NAS started rebuilding the volume with the 6th disk. Now if I can only fix the problem where smbd kills the CPU and disconnects guests from the share we’ll be gold.