Resuming suspend-to-disk with missing initrd

This morning two unfortunate things coincided:

Consequently, the machine failed to come up at all, showing a kernel panic about being unable to find the root device.

As I have way more volatile state in a running session than I'd like, I made every effort to resume the session, which (spoilers!) eventually worked. Note that this is not a tutorial, just a report that roughly these steps worked for me.

It's worth pointing out that it's likely one only gets one shot at this. Had I gotten it half-right to the point where the init system could open the encrypted root partition but got the resume partition wrong, it would come up and invalidate any hibernation image there was. (An issue I had run into several years ago where the system would try to resume from there on a subsequent boot and wreck everything has been resolved already as far as I understood).

Common setup

The computer I recovered is running Debian GNU/Linux sid, the helper device is on stable (buster).

Both have their internal SSD (nvme0n1) partitioned into an UEFI partition, a (too small!) boot partition, and a LUKS encrypted LLVM PV that holds at least a root and a swap partition.

Initial error

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

(Note that these don't come from screenshots, but from a reconstruction I did afterward, so some details might be off.)

This had me startled for a moment fearing for my disk, but the lack of a Loading initial ramdisk ... in the grub phase pointed me to the first-stage culprit: The grub entry did not have an initrd configured. (And none of the backup kernels had one either).

initrd transfusion

Fortunately, I have a second device nearby with a very similar setup. I copied its initrd to a USB stick and tried to boot from there.

echo 'Loading Linux 5.6.1-1-amd64'
linux /vmlinuz-5.6.0-1-amd64 root=-/dev/mapper/hephaistos--vg-root ro quiet
echo 'Manually added initrd'
initrd (hd0)/initrd.img-5.6.0-0.bpo.2-amd64

No luck though -- and the initrd shell barely showed anything in /dev at all. lsmod revealed no loaded modules. Turned out that the systems were not identical enough, and the kernel versions didn't match. A second run with matching kernel versions fixed this:

echo 'Loading Linux 5.6.1-1-amd64' linux (hd0)/vmlinuz-5.6.0-0.bpo.2-amd64 root=-/dev/mapper/hephaistos--vg-root ro quiet echo 'Manually added initrd' initrd (hd0)/initrd.img-5.6.0-0.bpo.2-amd64

UUIDs

A plain boot attempt from there did not work (it waited for the other computer's disk to show up, which it knew by its UUID to start decrypting it),

At this stage, I could extract the two UUIDs I knew I would later need:

# blkid /dev/nvme0n1p3
/dev/nvme0n1p3: UUID="b47fac73-66d5-42c6-a370-f9f1dce497d1" TYPE="crypto_LUKS" PARTUUID="76773950-442c-4b81-bf70-9b9da41ec5bf"
# cryltsetup luksOpen /dev/nvme0n1p3 decrypted
[...]
# blkid /dev/mapper/hephaistos--vg-swap_1
/dev/mapper/hephaistos--vg-swap_1: UUID="3c7f3271-f54f-41f0-8ec5-aee20977418e" TYPE="swap"

With this, I could take the USB stick back to the helper PC.

initrd fine tuning

As both the encrypted and the resume partition are named in the initrd, I tried building a suitable initrd.

On the helper PC, I changed /etc/crypttab to reflect the UUID of the NVM partition, and /etc/initramfs-tools/conf.d/resume to match the name that the encrypted swap partition would have after doing cryptsetup from the crypttab. (The swap partition's UUID did not go in there yet).

I used dpkg-reconfigure initramfs-tools to rebuild the helper's initramfs images. That may not have been a particularly wise move, but trusting that that device wouldn't need a reboot soon I took the risk temporarily. (Those images are a bit pesky to edit by hand; were that easier, I could have edited them on the USB stick and be done with it).

Fortunately, the updateinitramfs process run in that update pointed out an issue to me: It couldn't find the indicated resume device (how would it, it's on another machine), and fell back to a sane default.

That meant that resume would not work -- the gravest danger of this operation, as it would mean I'd lose the resume state.

Final startup

Fortunately, the resume partition can also be specified on the kernel command line, which is where the swap partition's UUID came in handy:

echo 'Loading Linux 5.6.1-1-amd64'
linux (hd0)/vmlinuz-5.6.0-0.bpo.2-amd64 root=-/dev/mapper/hephaistos--vg-root ro quiet resume=UUID=3c7f3271-f54f-41f0-8ec5-aee20977418e
echo 'Manually added initrd'
initrd (hd0)/initrd.img-5.6.0-0.bpo.2-amd64

With that, my system came up as it should be.

Of course, first thing I did there was dpkg-reconfigure initramfs-tools to ensure it would come up again.

Second thing, I restored the helper PC, or I'd have needed to do the same thing the other way 'round once more later.

Tags:blog-chrysn