OpenMediaVault – Good Times With mdadm

Happy 2019. I’ve been on holidays for three full weeks and it was amazing. I’ll get back to writing about boring stuff soon, but I thought I’d post a quick summary of some issues I’ve had with my home-built NAS recently and what I did to fix it.

Where Are The Disks Gone?

I got an email one evening with the following message.

I do enjoy the “Faithfully yours, etc” and the post script is the most enlightening bit. See where it says [UU____UU]? Yeah, that’s not good. There are 8 disks that make up that device (/dev/md0), so it should look more like [UUUUUUUU]. But why would 4 out of 8 disks just up and disappear? I thought it was a little odd myself. I had a look at the ITX board everything was attached to and realised that those 4 drives were plugged in to a PCI SATA-II card. It seems that either the slot on the board or the card are now failing intermittently. I say “seems” because that’s all I can think of, as the S.M.A.R.T. status of the drives is fine.

Resolution, Baby

The short-term fix to get the filesystem back on line and useable was the classic “assemble” switch with mdadm. Long time readers of this blog may have witnessed me doing something similar with my QNAP devices from time to time. After panic rebooting the box a number of times (a silly thing to do, really), it finally responded to pings. Checking out /proc/mdstat wasn’t good though.

[email protected]:~$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
unused devices: <none>

Notice the lack of, erm, devices there? That’s non-optimal. The fix requires a forced assembly of the devices comprising /dev/md0.

[email protected]:~$ sudo mdadm --assemble --force --verbose /dev/md0 /dev/sd[abcdefhi]
[sudo] password for dan:
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdf is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdh is identified as a member of /dev/md0, slot 7.
mdadm: /dev/sdi is identified as a member of /dev/md0, slot 6.
mdadm: forcing event count in /dev/sdd(2) from 40639 upto 40647
mdadm: forcing event count in /dev/sdc(3) from 40639 upto 40647
mdadm: forcing event count in /dev/sdf(4) from 40639 upto 40647
mdadm: forcing event count in /dev/sde(5) from 40639 upto 40647
mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sdd
mdadm: clearing FAULTY flag for device 2 in /dev/md0 for /dev/sdc
mdadm: clearing FAULTY flag for device 5 in /dev/md0 for /dev/sdf
mdadm: clearing FAULTY flag for device 4 in /dev/md0 for /dev/sde
mdadm: Marking array /dev/md0 as 'clean'
mdadm: added /dev/sdb to /dev/md0 as 1
mdadm: added /dev/sdd to /dev/md0 as 2
mdadm: added /dev/sdc to /dev/md0 as 3
mdadm: added /dev/sdf to /dev/md0 as 4
mdadm: added /dev/sde to /dev/md0 as 5
mdadm: added /dev/sdi to /dev/md0 as 6
mdadm: added /dev/sdh to /dev/md0 as 7
mdadm: added /dev/sda to /dev/md0 as 0
mdadm: /dev/md0 has been started with 8 drives.

In this example you’ll see that /dev/sdg isn’t included in my command. That device is the SSD I use to boot the system. Sometimes Linux device conventions confuse me too. If you’re in this situation and you think this is just a one-off thing, then you should be okay to unmount the filesystem, run fsck over it, and re-mount it. In my case, this has happened twice already, so I’m in the process of moving data off the NAS onto some scratch space and have procured a cheap little QNAP box to fill its role.

 

Conclusion

My rush to replace the homebrew device with a QNAP isn’t a knock on the OpenMediaVault project by any stretch. OMV itself has been very reliable and has done everything I needed it to do. Rather, my ability to build semi-resilient devices on a budget has simply proven quite poor. I’ve seen some nasty stuff happen with QNAP devices too, but at least any issues will be covered by some kind of manufacturer’s support team and warranty. My NAS is only covered by me, and I’m just not that interested in working out what could be going wrong here. If I’d built something decent I’d get some alerting back from the box telling me what’s happened to the card that keeps failing. But then I would have spent a lot more on this box than I would have wanted to.

I’ve been lucky thus far in that I haven’t lost any data of real import (the NAS devices are used to store media that I have on DVD or Blu-Ray – the important documents are backed up using Time Machine and Backblaze). It is nice, however, that a tool like mdadm can bring you back from the brink of disaster in a pretty efficient fashion.

Incidentally, if you’re a macOS user, you might have a bunch of .ds_store files on your filesystem. Or stuff like [email protected] or some such. These things are fine, but macOS doesn’t seem to like them when you’re trying to move folders around. This post provides some handy guidance on how to get rid of a those files in a jiffy.

As always, if the data you’re storing on your NAS device (be it home-built or off the shelf) is important, please make sure you back it up. Preferably in a number of places. Don’t get yourself in a position where this blog post is your only hope of getting your one copy of your firstborn’s pictures from the first day of school back.

OpenMediaVault – Expanding the Filesystem

I recently had the opportunity to replace a bunch of ageing 2TB drives in my OpenMediaVault NAS with some 3TB drives. I run it in a 6+2 RAID-6 configuration (yes, I know, RAID is dead). I was a bit cheeky and replaced 2 drives at a time and let it rebuild. This isn’t something I recommend you do in the real world. Everything came up clean after the drives were replaced. I even got to modify the mdadm.conf file again to tell it I had 0 spares. The problem was that the size of the filesystem in OpenMediaVault was the same as it was before. When you click on Grow it expects you to be adding drives. So, you can grow the filesystem, but you need to expand the device to fill the drives. I recommend taking a backup before you do this. And I unmounted my shares before I did this too.

If you’re using a bitmap, you’ll need to remove it first.

mdadm --grow /dev/md0 --bitmap none
mdadm --grow /dev/md0 --size max
mdadm --wait /dev/md0
mdadm --grow /dev/md0 --bitmap internal

In this example, /dev/md0 is the device you want to grow. It’s likely that your device is called /dev/md0. Note, also, that this will take some time to complete. The next step is to expand the filesystem to fit the RAID device. It’s a good idea to run a filesystem check before you do this.

fsck /dev/md0

Then it’s time to resize (assuming you had no problems in the last step).

resize2fs /dev/md0

You should then be able to remount the device and see the additional capacity. Big thanks to kernel.org for having some useful instructions here.

OpenMediaVault – Annoying mdadm e-mails after a rebuild

My homebrew NAS running OpenMediaVault (based on Debian) started writing to me recently. I’d had a disk failure and replaced the disk in the RAID set with another one. Everything rebuilt properly, but then this mdadm chap started sending me messages daily.

"This is an automatically generated mail message from mdadm
 running on openmediavault
A SparesMissing event had been detected on md device /dev/md0.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid6] [raid5] [raid4] 
 md0 : active raid6 sdi[0] sda[8] sdb[6] sdc[5] sdd[4] sde[3] sdf[2] sdh[1]
 11720297472 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
unused devices: <none>"

Which was nice of it to get in touch. But I’d never had spares configured on this md device. The fix is simple, and is outlined here and here. In short, you’ll want to edit /etc/mdadm/mdadm.conf and changes spares=1 to spares=0. This is assuming you don’t want spares configured and are relying on parity for resilience. If you do want spares configured then it’s probably best you look into the problem a little more.

QNAP – Increase RAID rebuild times with mdadm

I recently upgraded some disks in my TS-412 NAS and it was taking some time. I vaguely recalled playing with min and max settings on the TS-639. Here’s a link to the QNAP forums on how to do it. The key is the min setting, and, as explained in the article, it really depends on how much you want to clobber the CPU. Keep in mind, also, that you can only do so much with a 3+1 RAID 5 configuration. I had my max set to 200000, and the min was set to 1000. As a result I was getting about 20MBs, and each disk was taking a little less than 24 hours to rebuild. I bumped up the min setting to 50000, and it’s now rebuilding at about 40MBs. The CPU is hanging at around 100%, but the NAS isn’t used that frequently.

To check your settings, use the following commands:

cat /proc/sys/dev/raid/speed_limit_max
cat /proc/sys/dev/raid/speed_limit_min

To increase the min setting, issue the following command:

echo 50000 >/proc/sys/dev/raid/speed_limit_min

And you’ll notice that, depending on the combination of disks, CPU and RAID configuration, your rebuild will go a wee bit faster than before.

QNAP – How to repair RAID brokenness – Redux

I did a post a little while ago (you can see it here) that covered using mdadm to repair a munted RAID config on a QNAP NAS. So I popped another disk recently, and took the opportunity to get some proper output. Ideally you’ll want to use the web interface on the QNAP to do this type of thing but sometimes it no worky. So here you go.

Stop everything on the box.

[~] # /etc/init.d/services.sh stop
Stop service: recycled.sh mysqld.sh atalk.sh ftp.sh bt_scheduler.sh btd.sh ImRd.sh init_iTune.sh twonkymedia.sh Qthttpd.sh crond.sh nfs smb.sh lunportman.sh iscsitrgt.sh nvrd.sh snmp rsyslog.sh qsyncman.sh iso_mount.sh antivirus.sh .
Stop qpkg service: Disable Optware/ipkg
Shutting down SlimServer...
Stopping SqueezeboxServer 7.5.1-30836 (please wait) .... OK.
Stopping thttpd-ssods .. OK.
/etc/rcK.d/QK107Symform: line 48: /share/MD0_DATA/.qpkg/Symform/Symform.sh: No such file or directory

(By the way it really annoys me when I’ve asked software to remove itself and it doesn’t cleanly uninstall – I’m looking at you Symform plugin)

Unmount the volume

[~] # umount /dev/md0

Stop the array

[~] # mdadm -S /dev/md0
mdadm: stopped /dev/md0

Reassemble the volume

[~] # mdadm --assemble /dev/md0 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 /dev/sdf3
mdadm: /dev/md0 has been started with 5 drives (out of 6).

Wait, wha? What about that other disk that I think is okay?

[~] # mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Fri May 22 21:05:28 2009
Raid Level : raid5
Array Size : 9759728000 (9307.60 GiB 9993.96 GB)
Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
Raid Devices : 6
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Wed Dec 14 19:09:25 2011
State : clean, degraded
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 7c440c84:4b9110fe:dd7a3127:178f0e97
Events : 0.4311172
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 0 0 1 removed
2 8 35 2 active sync /dev/sdc3
3 8 51 3 active sync /dev/sdd3
4 8 67 4 active sync /dev/sde3
5 8 83 5 active sync /dev/sdf3

Or in other words

[~] # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : active raid5 sda3[0] sdf3[5] sde3[4] sdd3[3] sdc3[2]
9759728000 blocks level 5, 64k chunk, algorithm 2 [6/5] [U_UUUU]
md6 : active raid1 sdf2[2](S) sde2[3](S) sdd2[4](S) sdc2[1] sda2[0]
530048 blocks [2/2] [UU]
md13 : active raid1 sdb4[2] sdc4[0] sdf4[5] sde4[4] sdd4[3] sda4[1]
458880 blocks [6/6] [UUUUUU]
bitmap: 0/57 pages [0KB], 4KB chunk
md9 : active raid1 sdf1[1] sda1[0] sdc1[4] sdd1[3] sde1[2]
530048 blocks [6/5] [UUUUU_]
bitmap: 34/65 pages [136KB], 4KB chunk
unused devices: <none>

So, when you see [U_UUUU] you’ve got a disk missing, but you knew that already. You can add it back in to the array thusly.

[~] # mdadm --add /dev/md0 /dev/sdb3
mdadm: re-added /dev/sdb3

So let’s check on the progress.

[~] # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : active raid5 sdb3[6] sda3[0] sdf3[5] sde3[4] sdd3[3] sdc3[2]
9759728000 blocks level 5, 64k chunk, algorithm 2 [6/5] [U_UUUU]
[>....................] recovery = 0.0% (355744/1951945600) finish=731.4min speed=44468K/sec
md6 : active raid1 sdf2[2](S) sde2[3](S) sdd2[4](S) sdc2[1] sda2[0]
530048 blocks [2/2] [UU]
md13 : active raid1 sdb4[2] sdc4[0] sdf4[5] sde4[4] sdd4[3] sda4[1]
458880 blocks [6/6] [UUUUUU]
bitmap: 0/57 pages [0KB], 4KB chunk
md9 : active raid1 sdf1[1] sda1[0] sdc1[4] sdd1[3] sde1[2]
530048 blocks [6/5] [UUUUU_]
bitmap: 34/65 pages [136KB], 4KB chunk
unused devices: <none>
[~] #

And it will rebuild. Hopefully. Unless the disk is really truly dead. You should probably order yourself a spare in any case.