Getting a clean boot

Apr 14, 2025 lenovo homelab kubernetes nfsroot read-only-root tmpfs /var shared

Putting together a homelab Kubernetes cluster in my own stubborn way. I’m assuming a reader who’s basically me before I embarked on this little expedition, so I won’t go into minute detail about day-to-day Linux setup and administration - only the things that are new to me and have changed since I last encountered them.

Sections added as I actually proceed with this!

Cleaning up

One thing to address from part 4, it seems that ping is literally the only binary that was using capabilities. Is that weird? It seems weird to me…

$ sudo su -
$ cd /clients
$ find . -type f -executable -exec getcap {} \;
./usr/bin/ping cap_net_raw=ep

…but it does seem to be true. Good, I guess? If that was the only thing using them then giving it the setuid bit should be sufficient. Given that I’ve got no security on my NFS share within my local network it’s definitely not the biggest security issue at hand to grant it that way.

I checked on my laptop running Ubuntu as well, and that does have a few extra items (some of them binaries within containers). Lots of copies of ping and then:

Most of that doesn’t matter because it’s unlikely to ever run in this cluster, but I’ll remember¹ to look out for issues with newgidmap and newuidmap as the Kubernetes stuff is definitely going to involve some container namespaces!

I should probably keep an eye open for any capabilities settings with the Kubernetes binaries when I get to them as well.

So what else is sad?

Nothing concerning in the dmesg log output, but what about the systemd units?

dcminter@worker-node-448a5bddd8ba:~$  systemctl --failed
  UNIT                      LOAD   ACTIVE SUB    DESCRIPTION                           
● apt-daily-upgrade.service loaded failed failed Daily apt upgrade and clean activities
● apt-daily.service         loaded failed failed Daily apt download activities
● logrotate.service         loaded failed failed Rotate log files

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
3 loaded units listed.

Well, that makes sense for apt-daily-upgrade and apt-daily. I should probably just turn off the apt updates. This would be a horrible thing in a production system as I wouldn’t get security updates, but we’ve established this cluster isn’t going to be secure anyway. Once I figure out how to get it running at all I’ll worry about how to make sure the master read-only worker node image gets updated.

The logrotate not working is also not surprising; the logs are on a read-only filesystem so (a) I can’t write logs and (b) the log rotation can’t do anything to them. Time to make a decision there…

Sorting out /var

According to the Linux Foundation’s Filesystem Hierarchy Standard (FHS)…

Some portions of /var are not shareable between different systems. For instance, /var/log, /var/lock, and /var/run. Other portions may be shared, notably /var/mail, /var/cache/man, /var/cache/fonts, and /var/spool/news.

Uh… what’s actually under /var on these worker nodes?

dcminter@worker-node-448a5bddd8ba:~$ ls -al /var
total 44
drwxr-xr-x 11 root root 4096 Jun 29  2024 .
drwxrwxr-x 17 root root 4096 Oct 12 20:18 ..
drwxr-xr-x  2 root root 4096 Jun 30  2024 backups
drwxr-xr-x  7 root root 4096 Jun 29  2024 cache
drwxr-xr-x 14 root root 4096 Jun 29  2024 lib
drwxr-xr-x  2 root root 4096 Oct 12 21:12 local
lrwxrwxrwx  1 root root    9 Jun 29  2024 lock -> /run/lock
drwxr-xr-x  6 root root 4096 Oct 11 16:00 log
drwxrwsr-x  2 root mail 4096 Jun 29  2024 mail
drwxr-xr-x  2 root root 4096 Jun 29  2024 opt
lrwxrwxrwx  1 root root    4 Jun 29  2024 run -> /run
drwxr-xr-x  3 root root 4096 Jun 29  2024 spool
drwxrwxrwt  3 root root 4096 Oct 12 00:00 tmp

Right, and tmp seems to be a symlink to /tmp as well. That’s on the read-only filesystem and that’s surely going to cause trouble too.

Oh, and there are some tmpfs filesystems already…

dcminter@worker-node-448a5bddd8ba:~$ mount | grep tmpfs
udev on /dev type devtmpfs (rw,nosuid,relatime,size=8114380k,nr_inodes=2028595,mode=755,inode64)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=1628108k,mode=755,inode64)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=1628108k,nr_inodes=407027,mode=700,uid=1000,gid=1000,inode64)

So I can disregard lock and run because they’re already symlinking to tmpfs mounts.

I’m going to do this:

/tmp and /var/log are going to get their own tmpfs mounts.

But by default the following mount-points will be mounted to directories within the worker-node’s /var/local hierarchy

For the remaining shareable directories I will create additional NFS mounts adjacent to /workers/home on the cluster gateway:

Note that some of those overlap, but with this ordering it ought to be ok… I think?

Creating the tmpfs mounts

Adding the tmpfs mounts works fine - after a long digression because I’d forgotten to prefix the mount points with /root in the initrd mount script! Logs and tmp files are getting captured AOK (and this actually made it clearer that systemd is managing logs, not the old syslogd). Here are the lines added to my mount_node_nfs script:

# Mount the extra tmpfs filesystems
kprint "Mounting /tmp as a tmpfs with TMPSIZE=$TMPSIZE"	
mount -t tmpfs -o "nodev,noexec,nosuid,size=${TMPSIZE:-5%},mode=0777" tmpfs /root/tmp
kprint "Mounting /var/log as a tmpfs with LOGSIZE=$LOGSIZE"
mount -t tmpfs -o "nodev,noexec,nosuid,size=${LOGSIZE:-5%},mode=0775" tmpfs /root/var/log

By default I’m allowing each one an extra 5% of memory by default - at the moment TMPSIZE and LOGSIZE aren’t set, but I’m allowing for that to be overridden in some other config or boot command line parameters later.

Incidentally that problem where I’d forgotten to prefix the mountpoints with /root was really annoying to debug, because there weren’t any errors or anything - it’s just that there weren’t any tmpfs mounts in the booted system after the pivot. Hopefully I won’t forget about that again!

Per the plan, I basically zapped all the original directories and re-symlinked them under the /clients directory on the gateway machine so that they’d appear as symlinks in the appropriate places once NFS mounted on the worker nodes:

If I do this but nothing else then systemd seems to get a bit upset … the worker nodes still boot but terribly slowly and there are some complaints in the dmesg log output:

...
[    9.611374] systemd[1]: systemd-random-seed.service: Main process exited, code=exited, status=1/FAILURE
[    9.611651] systemd[1]: systemd-random-seed.service: Failed with result 'exit-code'.
[    9.611953] systemd[1]: Failed to start systemd-random-seed.service - Load/Save Random Seed.
[    9.612479] systemd[1]: first-boot-complete.target - First Boot Complete was skipped because of an unmet condition check (ConditionFirstBoot=yes).
[    9.621888] systemd[1]: Finished systemd-sysctl.service - Apply Kernel Variables.
[    9.642473] systemd[1]: Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
[    9.642804] systemd[1]: Reached target local-fs-pre.target - Preparation for Local File Systems.
[    9.643025] systemd[1]: Reached target local-fs.target - Local File Systems.
[    9.645500] systemd[1]: systemd-binfmt.service - Set Up Additional Binary Formats was skipped because of an unmet condition check (ConditionPathIsMountPoint=/proc/sys/fs/binfmt_misc).
[    9.645643] systemd[1]: systemd-machine-id-commit.service - Commit a transient machine-id on disk was skipped because of an unmet condition check (ConditionPathIsMountPoint=/etc/machine-id).
...

However, I then also added the following lines into the initrd image’s mount_node_nfs script to create these target directories under the share:

# Make sure the various /var/ mount points are OK
mkdir -p /root/var/local/backups
mkdir -p /root/var/local/cache
mkdir -p /root/var/local/lib
mkdir -p /root/var/local/opt
mkdir -p /root/var/local/spool

With this change the reboot runs swiftly, and after the reboot I can see that various additional subdirectories have been populated in the worker-specific directories; before the changes the /workers NFS share content was:

dcminter@cluster-gateway:/workers$ tree
.
├── 0023249434ae
│   ├── cmdline
│   ├── hostname
│   └── hosts
├── 448a5bddd8ba
│   ├── cmdline
│   ├── hostname
│   └── hosts

After modifying the init script and, recalling that the init script is only directly creating 10 (2 nodes x 5 additions) of these directories, the contents of the /workers share has become:

dcminter@cluster-gateway:/workers$ tree
.
├── 0023249434ae
│   ├── backups
│   ├── cache
│   │   └── private  [error opening dir]
│   ├── cmdline
│   ├── hostname
│   ├── hosts
│   ├── lib
│   │   ├── dbus
│   │   │   └── machine-id -> /etc/machine-id
│   │   ├── private  [error opening dir]
│   │   └── systemd
│   │       ├── coredump
│   │       ├── linger
│   │       ├── pstore
│   │       ├── random-seed
│   │       └── timers
│   │           ├── stamp-apt-daily.timer
│   │           ├── stamp-apt-daily-upgrade.timer
│   │           ├── stamp-e2scrub_all.timer
│   │           ├── stamp-fstrim.timer
│   │           └── stamp-logrotate.timer
│   ├── opt
│   └── spool
│       └── cron
│           └── crontabs  [error opening dir]
├── 448a5bddd8ba
│   ├── backups
│   ├── cache
│   │   └── private  [error opening dir]
│   ├── cmdline
│   ├── hostname
│   ├── hosts
│   ├── lib
│   │   ├── dbus
│   │   │   └── machine-id -> /etc/machine-id
│   │   ├── private  [error opening dir]
│   │   └── systemd
│   │       ├── coredump
│   │       ├── linger
│   │       ├── pstore
│   │       ├── random-seed
│   │       └── timers
│   │           ├── stamp-apt-daily.timer
│   │           ├── stamp-apt-daily-upgrade.timer
│   │           ├── stamp-e2scrub_all.timer
│   │           ├── stamp-fstrim.timer
│   │           └── stamp-logrotate.timer
│   ├── opt
│   └── spool
│       └── cron
│           └── crontabs  [error opening dir]
├── home
│   ├── dcminter
│   └── root  [error opening dir]
└── var

Ignore the “error opening dir” messages; that’s just because I didn’t run it as root, whereas on the workers some of those directories are being created for root’s eyes only.

Anyway, now the dmesg output has calmed down:

[    9.648331] systemd[1]: Starting systemd-random-seed.service - Load/Save Random Seed...
[    9.650213] systemd[1]: systemd-sysusers.service - Create System Users was skipped because no trigger condition checks were met.
[    9.651264] systemd[1]: Starting systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev...
[    9.656319] systemd[1]: Finished systemd-modules-load.service - Load Kernel Modules.
[    9.657561] systemd[1]: Starting systemd-sysctl.service - Apply Kernel Variables...
[    9.684091] systemd[1]: Finished systemd-random-seed.service - Load/Save Random Seed.
[    9.684518] systemd[1]: first-boot-complete.target - First Boot Complete was skipped because of an unmet condition check (ConditionFirstBoot=yes).
[    9.687169] systemd[1]: Finished systemd-sysctl.service - Apply Kernel Variables.
...

So that’s looking pretty solid.

Creating the shared var mountpoints

Next up the remaining NFS mountpoints that don’t need to be exclusive to the individual workers. Here I planned to copy across all of the original files rather than starting with empty directories.

I screwed up slightly though… I deleted the contents of /clients/var/cache and /clients/var/spool in the preceding step, so I don’t have any contents to copy over for those two. However, checking the contents of the gateway server’s own OS install (which, you may recall, is also a Debian 12 system) it seems like they’re not populated with any actual files beyond auto-created ones, so I just create my new targets as empty directories, i.e. I create:

Then I add in the corresponding NFS mount points into the initrd image’s script:

# Mount the shared /var nfs paths

kprint "Mounting the shared /var nfs paths"

# /var/mail will already exist as an empty directory from the root nfs mount	
nfsmount -o rw 192.168.0.254:/workers/var/mail /root/var/mail

mkdir -p /root/var/local/cache/man
nfsmount -o rw 192.168.0.254:/workers/var/cache/man /root/var/local/cache/man

mkdir -p /root/var/local/cache/fonts
nfsmount -o rw 192.168.0.254:/workers/var/cache/fonts /root/var/local/cache/fonts

mkdir -p /root/var/local/spool/news
nfsmount -o rw 192.168.0.254:/workers/var/spool/news /root/var/local/spool/news

After the pivot the /root/... directories should correspond to those empty directories created under /workers/... on the gateway (NFS server).

I gave this a quick test by on the worker creating example files in e.g. /var/spool/news and verifying that on the gateway they materialise in /workers/var/spool/news so that looks solid. None of these are directories I really anticipate using in the kubernetes part of this project, but it’s good to get everything looking at least roughly correct.

What’s left?

I do see the following lines from the logs emitted by journalctl at this point…

Apr 13 17:03:42 worker-node-448a5bddd8ba systemd-tmpfiles[246]: "/var/lib" already exists and is not a directory.
Apr 13 17:03:42 worker-node-448a5bddd8ba systemd[1]: Started systemd-udevd.service - Rule-based Manager for Device Events and Files.
Apr 13 17:03:42 worker-node-448a5bddd8ba systemd-tmpfiles[246]: Failed to create directory or subvolume "/root", ignoring: Read-only file system
Apr 13 17:03:42 worker-node-448a5bddd8ba systemd-tmpfiles[246]: Failed to open path '/root', ignoring: No such file or directory
Apr 13 17:03:42 worker-node-448a5bddd8ba systemd-tmpfiles[246]: "/var/cache" already exists and is not a directory.
Apr 13 17:03:42 worker-node-448a5bddd8ba systemd-tmpfiles[246]: "/var/spool" already exists and is not a directory.

It’s quite true that these paths are no longer directories; they’re symlinks now. Consulting the manpage for tmpfiles.d it looks like I can edit the files under /clients/usr/lib/tmpfiles.d and adjust the configurations so that instead of managing directories they manage symlinks. Better yet, the exact matches with those paths (rather than subdirectories under them) are all managed from the var.conf file there.

Editing that file, I make the following changes (these lines are all at the end of the file). Before my changes, the d character indicates a “directory to create and clean up”:

d /var/cache 0755 - - -

d /var/lib 0755 - - -

d /var/spool 0755 - - -

After my changes, the L character indicates a “symlink to create”:

L /var/cache 0755 - - -

L /var/lib 0755 - - -

L /var/spool 0755 - - -

This knocks those errors from journalctl on the head successfully.

That server IP address

Just one more thing I want to clean up now - the big mount_node_nfs script has the IP address of the NFS server hard-coded into it. I’d prefer to take that directly from the incoming command line. On one of the worker nodes that command line (readable from /proc/cmdline you remember) ends up looking like this:

BOOT_IMAGE=vmlinuz-6.1.0-21-amd64 root=/dev/nfs nfsroot=192.168.0.254:/clients,ro ip=dhcp nfsrootdebug initrd=initrd-cluster-25_04_13_22_51_44_CEST ip=192.168.0.3:192.168.0.254:192.168.0.254:255.255.255.0 BOOTIF=01-44-8a-5b-dd-d8-ba CPU=6PVXL

The nfsroot parameter has to contain the address of the NFS server. At some point I will want to separate the NFS server from the Gateway server, so I’d rather take it from there than assume that the gateway part of the ip parameter (that has the node’s IP, the gateway’s IP, and the netmask in it) is the right value. A bit of sed magic should do the trick there…

sed -n 's/.* nfsroot=\([0-9.]*\).*/\1/p'

That assumes an IPv4 address and not a host name or an IPv6 address, but those do seem like safe assumptions here, I’m not going to add IPv6 to the already long-winded project.

First I pop in the sed script and a diagnostic output…

NFS_SERVER=$(/bin/sed -n 's/.* nfsroot=\([0-9.]*\).*/\1/p' /proc/cmdline)
kprint "NFS server IP address is $NFS_SERVER"

Checking the dmesg log after a reboot…

...
[    8.892871] dcminter: NFS server IP address is 192.168.0.254
...

That looks right. Last thing to do is rewrite the explicit uses of 192.168.0.254 in the script to use the environment variable instead (just a search and replace in the script).

This bit worked first time! Here’s the full init script that I’ve ended up with so far:

#!/bin/sh -e

function kprint() {
	echo "dcminter: $1" > /dev/kmsg
}

function nfs_mount_node() {
	kprint 'About to attempt to mount the node share on nfs'
	kprint "Boot variable is $BOOTIF"
	kprint "rootmnt is $rootmnt"

	SUFFIX=$(/bin/sed 's/.*BOOTIF=\(..\-..\-..\-..\-..\-..\-..\).*/\1/' /proc/cmdline | /bin/sed 's/..\-\(..\)\-\(..\)\-\(..\)\-\(..\)\-\(..\)\-\(..\)/\1\2\3\4\5\6/')
	kprint "Target hostname directory has suffix $SUFFIX"

	NFS_SERVER=$(/bin/sed -n 's/.* nfsroot=\([0-9.]*\).*/\1/p' /proc/cmdline)
	kprint "NFS server IP address is $NFS_SERVER"

	kprint "Dumping boot commandline"
	cat /proc/cmdline > /dev/kmsg

	# Note - Anything mounted under /root will be under / after the pivot!

	kprint "Create the node directory under the /workers share"
	mkdir -p /workers # Mount ephemerally - this mountpoint intentionally won't be around after the pivot!
	nfsmount -o rw $NFS_SERVER:/workers /workers
	kprint "Creating the node's directory if it does not already exist"
	mkdir -p /workers/$SUFFIX
	umount /workers
	kprint "Unmounted /workers"

	kprint "Mounting /workers/home to /root/home in preparation for pivot"
        nfsmount -o rw $NFS_SERVER:/workers/home /root/home
	kprint "Mounted /root/home"

	kprint "Mount the node's directory as the writeable /root/var/local"
	nfsmount -o rw $NFS_SERVER:/workers/$SUFFIX /root/var/local
	kprint "Mounted /root/workers"

	# Make sure the various /var/ mount points are OK
	mkdir -p /root/var/local/backups
	mkdir -p /root/var/local/cache
	mkdir -p /root/var/local/lib
	mkdir -p /root/var/local/opt
	mkdir -p /root/var/local/spool

	# Mount the shared /var nfs paths

	kprint "Mounting the shared /var nfs paths"

        # /var/mail will already exist as an empty directory from the root nfs mount	
	nfsmount -o rw $NFS_SERVER:/workers/var/mail /root/var/mail

	mkdir -p /root/var/local/cache/man
	nfsmount -o rw $NFS_SERVER:/workers/var/cache/man /root/var/local/cache/man

	mkdir -p /root/var/local/cache/fonts
	nfsmount -o rw $NFS_SERVER:/workers/var/cache/fonts /root/var/local/cache/fonts

	mkdir -p /root/var/local/spool/news
	nfsmount -o rw $NFS_SERVER:/workers/var/spool/news /root/var/local/spool/news
	
	# Mount the extra tmpfs filesystems

        kprint "Mounting /tmp as a tmpfs with TMPSIZE=$TMPSIZE"	
	mount -t tmpfs -o "nodev,noexec,nosuid,size=${TMPSIZE:-5%},mode=0777" tmpfs /root/tmp
	kprint "Mounting /var/log as a tmpfs with LOGSIZE=$LOGSIZE"
	mount -t tmpfs -o "nodev,noexec,nosuid,size=${LOGSIZE:-5%},mode=0775" tmpfs /root/var/log

	# Set the hostname (and make it sticky)
	kprint "Set the hostname"
	hostname "worker-node-$SUFFIX"
	echo "worker-node-$SUFFIX" > /root/var/local/hostname
	kprint "Hostname should be worker-node-$SUFFIX now"

	kprint "Adding loopback resolution for hostname"
	cp /etc/hosts /root/var/local/hosts
	echo "127.0.0.1		worker-node-$SUFFIX" >> /root/var/local/hosts
	echo "::1		worker-node-$SUFFIX" >> /root/var/local/hosts
	kprint "Appropriate hosts file created."

	kprint "Write cmdline to var mount to make ip address identification easier from outside the worker node"
	cat /proc/cmdline > /root/var/local/cmdline
	kprint "Written cmdline"
}

kprint 'We ran a boot script after mounting root'
nfs_mount_node

Here are all of the diagnostic outputs from that script that end up in the dmesg log after booting one of the worker nodes:

[    8.882392] dcminter: We ran a boot script after mounting root
[    8.882490] dcminter: About to attempt to mount the node share on nfs
[    8.882562] dcminter: Boot variable is 01-00-23-24-94-34-ae
[    8.882631] dcminter: rootmnt is /root
[    8.883791] dcminter: Target hostname directory has suffix 0023249434ae
[    8.884616] dcminter: NFS server IP address is 192.168.0.254
[    8.884689] dcminter: Dumping boot commandline
[    8.885342] BOOT_IMAGE=vmlinuz-6.1.0-21-amd64 root=/dev/nfs nfsroot=192.168.0.254:/clients,ro ip=dhcp nfsrootdebug initrd=initrd-cluster-25_04_13_23_35_12_CEST ip=192.168.0.4:192.168.0.254:192.168.0.254:255.255.255.0 BOOTIF=01-00-23-24-94-34-ae CPU=6PVXL
[    8.885596] dcminter: Create the node directory under the /workers share
[    8.891270] dcminter: Creating the node's directory if it does not already exist
[    8.925022] dcminter: Unmounted /workers
[    8.925099] dcminter: Mounting /workers/home to /root/home in preparation for pivot
[    8.930581] dcminter: Mounted /root/home
[    8.930656] dcminter: Mount the node's directory as the writeable /root/var/local
[    8.957847] dcminter: Mounted /root/workers
[    8.968580] dcminter: Mounting the shared /var nfs paths
[    9.065769] dcminter: Mounting /tmp as a tmpfs with TMPSIZE=
[    9.066910] dcminter: Mounting /var/log as a tmpfs with LOGSIZE=
[    9.068002] dcminter: Set the hostname
[    9.070544] dcminter: Hostname should be worker-node-0023249434ae now
[    9.070619] dcminter: Adding loopback resolution for hostname
[    9.074700] dcminter: Appropriate hosts file created.
[    9.074777] dcminter: Write cmdline to var mount to make ip address identification easier from outside the worker node
[    9.076708] dcminter: Written cmdline

The script could be a bit more elegant - it’s a bit repetetive and parts of it likely ought to be split out into their own dedicated init scripts, but for now this will do.

I don’t suppose it’s the last time I’ll rebuild the initd image, but it’s a good point to move on from.

How sad are we now?

But wait, are those systemd units all working ok now?

dcminter@worker-node-0023249434ae:~$ systemctl --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

Yes. Much less sad!

Moar Contacts!

Since I posted the last entry in this series on setting up the cluster I got another note, this time from Alexander who’s setting up something similar in overall effect, but likely a lot more robust in actual implementation. His approach is based more around overlay filesystems.

Very nice to get these alternative perspectives on a similar project - and interesting that there seem to be quite a few lone enthusiasts setting up remote booted clusters like mine.

Next

This part felt more like tidying up than anything really tricky. It took me a while to get back to finish it with other distractions, but it was pretty plain sailing once I did.

I’m pretty sure it’s going to take more than one part for the rest of getting Kubernetes up and running on this little cluster of machines. One thing I can see might be a problem is that I’ve not left myself an easy way to install new software at the OS level easily on these systems - but it wouldn’t be fun without a few bumps in the road. That said, it would not surprise me at all if I have to back track a bit and try something like the approach Alexander adopted with overlay filesystems. We’ll see.

So. Coming soon for extremely large values of elapsed time for soon: Part 6 - Kubernetes at last


Footnotes

¹ I have a nasty feeling that this is foreshadowing and that actually I will totally forget that and have a frickin' nightmare figuring out that this is at the root of some config black hole. You can laugh at me when I get there if so.

© 2017 - 2025 Dave Minter