ReadyNas disaster recovery

Last night my ReadyNas suddenly went read-only with the ominous sounding log message Jan 30, 2023 16:54:47 Volume: The volume data encountered an error and was made read-only. It is recommended to backup your data. - it looks like it ran out of RAM which ultimately caused BTRFS corruption. Despite 3 hours of attempting to recover, I was unable to repair the underlying BTRFS filesystem, so decided the only option was to reset the NAS (which would recreate the filesystem) and execute the disaster recovery (DR) plan.

The unerldying BTRFS has, irrecoverable according to the manual, parent transid verify failed errors - visible by enabling SSH access and logging in to run dmesg. The consensus on the NetGear forums also appears to be “it’s screwed, wipe and rebuild from backup is the only way forward”. Worryingly this seems like a common problem - NetGear suggest it should only happen with a disk fault (which is itself worrying given this is supposed to be RAID protected) but, like other users in the forums, there’s no fault logged by the ReadyNAS, ATA error counts for all 4 disks are 0 (zero) and SMART checks report they are healthy disks.

The system log is also unhelpful in diagnosing the cause - a normal ntp update it logged (these happen about every 2 minutes) then the filesystem has gone read-only during a deletion:

Jan 30 17:34:13 isolinear connmand[2116]: ntp: adjust (slew): +0.000215 sec
Jan 30 17:34:45 isolinear smbd[3655]: [2023/01/30 17:34:45.092926,  0] ../source3/modules/vfs_fruit.c:4160(fruit_unlink)
Jan 30 17:34:45 isolinear smbd[3655]:   fruit_unlink: Forced unlink of [2022-11-02 From Dell XPS (Windows 10) pre re-install/Projects/PGCHE/.git/index:AFP_Resource] failed [Read-only file system]

Oddly no errors at all were in system.log around the time the error was logged.

I think I found the smoking gun in kernel.log - at 16:44 the out-of-memory killer was triggered, and the “workqueue” does mention btrfs:

Jan 30 16:44:26 isolinear kernel: kworker/u8:7 invoked oom-killer: gfp_mask=0x2400840, order=0, oom_score_adj=0
Jan 30 16:44:27 isolinear kernel: kworker/u8:7 cpuset=/ mems_allowed=0
Jan 30 16:44:27 isolinear kernel: CPU: 3 PID: 5904 Comm: kworker/u8:7 Tainted: P        W  O    4.4.218.alpine.1 #1
Jan 30 16:44:27 isolinear kernel: Hardware name: Annapurna Labs Alpine
Jan 30 16:44:27 isolinear kernel: Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
[...]
Jan 30 16:44:28 isolinear kernel: Out of memory: Kill process 25419 (rsync) score 424 or sacrifice child
Jan 30 16:44:28 isolinear kernel: Killed process 25419 (rsync) total-vm:1445248kB, anon-rss:1346512kB, file-rss:756kB

A few minutes later, at 16:53 BTRFS fell over (approximately when it appeared in the UI log):

Jan 30 16:53:53 isolinear kernel: BTRFS error (device md127): parent transid verify failed on 33592936824832 wanted 23746221 found 23869946
Jan 30 16:53:53 isolinear kernel: BTRFS error (device md127): parent transid verify failed on 33592936824832 wanted 23746221 found 23869946
Jan 30 16:53:53 isolinear kernel: BTRFS warning (device md127): Skipping commit of aborted transaction.
Jan 30 16:53:53 isolinear kernel: BTRFS: error (device md127) in cleanup_transaction:1864: errno=-5 IO failure
Jan 30 16:53:53 isolinear kernel: BTRFS info (device md127): forced readonly
Jan 30 16:53:53 isolinear kernel: BTRFS: error (device md127) in btrfs_drop_snapshot:9420: errno=-5 IO failure
Jan 30 16:53:53 isolinear kernel: BTRFS info (device md127): delayed_refs has NO entry
Jan 30 16:54:46 isolinear kernel: BTRFS error (device md127): parent transid verify failed on 33593071370240 wanted 23869946 found 23869944

I did feedback on the ReadyNas forum about this problem.

Doing the rebuild and DR

Fortunately in setting up my home-lab, I have tested and documented bare-metal DR so, in addition to off-site backups, I have a rehearsed DR plan that I have confidence in.

As the disk has mounted read-only, I was able to take copies the files created since the last off-site backup - I used the find command I notes in a previous post to find all the files made in January and then made a copy with rsync:

find /media/isolinear -newermt '2023-01-01T00:00:00'

I also took screenshots of the configuration pages to refer do during the rebuild.

Checking the last off-site backup

Following my previous DR instructions, I plugged the off-site disk into the backup server and opened the encrypted volume read-only:

cryptsetup luksOpen --readonly /dev/disk/by-partlabel/backuppc-offsite? backuppc-recovery-pv

Rebuilding the NAS

Once satisfied the backup was good, I did a factory reset on the NAS. I had to plug it into a non-LAG port due to losing the bonded connection in order to connect

I then:

Completed the initial wizard (including configuring SMTP and the admin password).
Connected to the UPS.
Re-added all of the users.
Created the iSCSI volume for backups - using the same initiator code as the old one, so it just worked with the existing server. (Don’t forget to assign the LUN to the group, or you will wonder why no volume appears when you login later…)
Created the bonded network interface and re-attached the NAS to the LCAP/LAG ports

Recovering the backup store

I attached the new iSCSI volume using the standard approach (iscsiadm --login). Then I partitioned the block device and set it up according to my original instructions for building the iSCSI volume - following the steps for:

partitioning.
encryption setup.
LVM setup. (Although I did note that lvcreate -n store -l 3725G backuppc in the notes should have been lvcreate -n store -L 3725G backuppc)

I then cloned the off-site backup disk to the new volume using e2image (the same tool used to create the off-site backup). As an optimisation, I removed the -c flag I usually use that would check if the destination block matches the source (to not copy if unnecessary) as the new volume is blank so all the check will do is slow it down:

e2image -ra -p /dev/mapper/backuppc-recovery-pv /dev/mapper/backuppc-store

Then wait a while (~16hrs) for the backup to copy back.

Restoring files from backup

Once the copy was complete, I did an fsck on the restored copy, unmounted the off-site disk (which can now be returned to the off-site location) and mounted the iSCSI volume:

fsck -t ext4 /dev/mapper/backuppc-store -f -n
cryptsetup luksClose backuppc-recovery-pv
udisksctl power-off -b "/dev/$( lsblk -no pkname "/dev/disk/by-partlabel/backuppc-offsite?" )"
mount /var/lib/backuppc

Before starting backuppc, I set $Conf{BackupsDisable} = 2; in /etc/backuppc/config.pl to stop any new backups being started until the restores are complete. Remember to set it back to $Conf{BackupsDisable} = 0; when done to resume scheduling backups.

Then started backuppc, to begin restoring files from the last backup:

systemctl start backuppc

As the NAS’ volumes are usually read-only to backuppc, I made them read/write then set them back to read-only as I restored each folder one-by-one.

Once the main backup files were restored, I used rsync to sync back the rescued files (changed since 1st January) and finally restored the “non-backed up” (in the sense of no versioned or automated backups) files from their respective copies.

The restoring was finally completed on 12th February, so from disaster (30th January) total time to recover was 13 days.