Adding dedicated OS drives to Proxmox cluster

Following on from my post on adding the 2.5GbE network to Proxmox cluster, this post documents setting up the new OS disks and upgrading to Proxmox 9 in the process. I did this by re-installing the nodes, one by one (the disks were already physically installed).

The process I intend to follow is:

Remove node to be upgraded from the Ceph, Proxmox and Vault clusters.
Re-install the node with Debian 13 “Trixie” and installed Proxmox 9, on the new NVMe M.2 drives, using the extant 2.5” SATA SSDs as dedicated Ceph drives.
Re-add the node to the clusters.
Repeat until all 5 nodes have been upgraded and are using the dedicated drives for Ceph.

My new IP KVM was very useful for doing this - particularly the Secure Boot configuration - however the virtual keyboard does not work with these computers, probably for the same reason it will not recognise keyboards connected via USB hubs, so I had to use a physical keyboard alongside it.

Remove node to be reinstalled from cluster

The recommendation I found on the Proxmox Forum is to drain the Ceph OSD before removal. The Ceph MGR and MON can be destroyed directly, then remove the node from Proxmox.

Migrate all VMs off the node - I did this from the Proxmox UI.

Some VMs have USB drives (all on my pve05 node, as it happens) mapped by USB device vendor and device ID, so would work on any VM but Proxmox still doesn’t allow migration and gives the error Cannot migrate VM with local resources. I removed the devices and then migrated them - to be re-added after. The devices were (VM name and device list):
- secureboot-test:
  - usb0 - 1ea8:f825 (NOT USB3)
- starfleet-archives:
  - usb0 - 152d:0569 (USB3)
  - usb1 - 0bc2:231a (USB3)
  - usb2 - 0bc2:2344 (USB3)
Remove the node from the Ceph cluster (on the node being removed):
1. Remove the OSD:
  
  I used a hybrid of commands from the instructions for Destroy OSDs (Proxmox), replacing an OSD (Ceph) and removing an OSD (Ceph):
  1. With a healthy Ceph cluster, I set the OSD to be replaced to out:
    
    (ceph osd tree will tell you which OSDs are on which node.)
    # 1 is the OSD number - replace with the appropriate one on each node OSD_ID=1 ceph osd out ${OSD_ID}
  2. I then waited for the filesystem to rebalance (can run sudo ceph -w to watch progress - sudo ceph osd safe-to-destroy osd.${OSD_ID} will say it cannot be safely removed until this is complete anyway). This loop from the Ceph documentation will also give a progress on the number of pgs mapped to the OSD:
    while ! ceph osd safe-to-destroy osd.${OSD_ID} ; do sleep 10 ; done
  3. Stop the OSD service - Ceph will not allow it to be destroyed if it is up:
    pveceph stop --service osd.${OSD_ID}
  4. Destroy the OSD:
    pveceph osd destroy ${OSD_ID}
2. Monitor, manager and metadata can just be destroyed:
```
 MON_ID=$(hostname -s)
 pveceph mon destroy ${MON_ID}
 pveceph mgr destroy ${MON_ID}
 pveceph mds destroy ${MON_ID}
```
Remove the node from the Vault cluster:

This can be done from any host with vault access and requires root access on my cluster. I generated a root token using the standard process:
```
 # Replace with node to remove
 NODE_ID=pve04
 vault operator raft remove-peer ${NODE_ID}
```
Remove the node from the Proxmox cluster:

This must be done from another node or you will get error Cannot delete myself from cluster!.
```
 # Replace with node to remove
 NODE_ID=pve04
 sudo pvecm delnode ${NODE_ID}
```

Re-installing the node

I already have Ansible playbooks that fully automate (re)installing systems, however this was the first Debian Trixie system and Proxmox 9 I was installing so I had to make a few changes before I installed the first one.

The reinstall process does a basic install of Debian with LVM on an encrypted partition, which is identical regardless of the ultimate purpose of the server (host-specific configuration is done post-install), so little change was required for this - just updating it to install the new Stable version, Debian “Trixie” 13.

Adding Debian Trixie support

The iPXE file that the boots the Debian installer is configured using a variable for the URL to the installer, so all that was needed in there was to update the variable in my install.yaml playbook:

- name: Auto install iPXE configuration is correct
  become: true
  ansible.builtin.copy:
    # [...]
  vars:
    debian_release: trixie
    # [...]

For my Debian installer pressed configurations, I modified the debian_releases variable in my debian-installer role’s settings to remove Bullseye (now old-old-stable) and add Trixie (stable). In the role’s defaults/main.yaml:

debian_releases:
  - bookworm
  - trixie

And the corresponding specification in meta/argument_specs.yaml:

debian_releases:
  short_description: List of Debian releases (codenames) to deploy preseeds for.
  # [...]
  default:
    - bookworm
    - trixie

Again, as everything in the preseed files is templated from these variables I did not need to change anything else here to add Trixie support.

Skipping additional firmware prompt and choosing network interface

The new 2.5GbE USB adaptors were causing the Debian installer to prompt for loading additional firmware. These adaptors are not involved in the PXE boot and installation, so (as it occurs before the network is configured to fetch the preseed file) I skipped this prompt by adding hw-detect/load_firmware=false to the kernel options in the auto-install iPXE configuration in install.yaml playbook.

Once I had done this, I was still getting prompted to choose which network interface to use - adding netcfg/choose_interface=auto to the kernel options, in the same file, skipped this too.

Using the first disk, if a host has more than one

These hosts now have multiple disks in them, my preseed has only ever been used for (and hence supported) hosts with a single disk. To get around this, I added a little script as the partman/early_command setting that finds the first disk and tells the installer to use that. It was added the templates/preseed.cfg.j2 template in my debian-installer ansible role:

# Force use of first disk if more than one device exists
d-i partman/early_command string                  \
  if [ "$( list-devices disk | wc -l )" -gt 1 ];  \
  then                                            \
    DISK="$( list-devices disk | head -n1 )";     \
    debconf-set partman-auto/disk "${DISK}";      \
    debconf-set grub-installer/bootdev "${DISK}"; \
  fi

Applying the installation process changes

Once all these changes had been made, I re-ran just the relevant play against my installer script server to create the installer preseed files and associated scripts for the installation process to use:

ansible-playbook -i inventory.yaml -l debian_installer_sources -t debian_installer site.yaml

Reinstalling the OS

ansible-playbook -i inventory.yaml -e REDEPLOY_HOSTS=pve01 reinstall.yaml -K

The Debian Trixie installer seems to be ignoring the DHCP supplied search domains, so I had to modify my mirror configuration to use the fully-qualified path. This is controlled by variables in my inventory (this is on the domain group, as the URLs differ between my live and lab environments which I differentiate by their domain name) in group_vars/domain_home_entek_org_uk.yaml:

local_mirror:
  debian:
    uri: http://mirror.home.entek.org.uk/debian
  debian-security:
    uri: http://mirror.home.entek.org.uk/debian-security
  hashicorp:
    uri: http://mirror.home.entek.org.uk/hashicorp
  proxmox-no-subscription:
    uri: http://mirror.home.entek.org.uk/proxmox-no-subscription
  github:
    uri: http://mirror.home.entek.org.uk/github
  openbsd:
    uri: http://mirror.home.entek.org.uk/openbsd

After install, each system showed a blue screen headed Boot Option Restoration, with the message Press any key to stop system reset and then rebooted, in a continuous loop until interrupted. I selected always continue boot from the menu that appeared (if you pressed “any key” in time), in the hope it will automatically pass through it in the future.

I also enabled Secure Boot after the base OS install. I couldn’t get this working with Proxmox 8 but wanted to try again with Proxmox 9. So far, I have had no problems will Secure Boot enabled - even enabling it before the Proxmox install was done on top of Debian.

There is a problem with the install process, which I didn’t find when I made the changes to support dynamic client (despite the date on the blog post, I did it long before this but did not find time to write it up properly until after). When the DHCP address during the install doesn’t match the DNS address (due to the DHCP config override to prevent the IP changing during the install) which results in the eventual static IP still remaining in known_hosts with the old key. This Should be fixable by removing the DNS IP entry when the hostname is removed after the reinstall begins and adding the DNS IP address, as well as the hostname, at the end of the install the process. The servers each also had to be rebooted again after DHCP config was restored at the end of the install, to get correct IP from DHCP server.

Configuring the reinstalled node (including reinstalling Proxmox & re-adding to cluster)

My automated install creates a Debian install that is identical (i.e. static) on each host. All customisation is (currently) done post-install with my master site.yaml playbook. Before running this on the newly configured hosts, I had to make some changes to the Ansible code - some of these I made before attempting the first node, others I made as I went through the process of rebuilding the first node.

Ansible configuration changes

Logical volumes

I expanded the initial size of the root logical volume to 20G (from 8G), and removed the old ceph_osd logical volume from the proxmox_virtual_environment_hosts inventory group (since they are all identical, I could set this on the group in group_vars/proxmox_virtual_environment_hosts.yaml):

filesystems_lvm_volume_groups:
  - name: vg_
    logical_volumes:
      - name: root
        size: 20G

Static LUKs passphrase

These hosts now have multiple disks and my existing play for adding a status passphrase only caters for one encrypted device (sensibly, I realised this would not work for more than one at the time and added an assertion that encrypted_block_devices | length == 1). Since Proxmox will be managing the Ceph device encryption, I only needed to encrypt one disk. I worked around this by providing an encrypted device list as a variable, for the Proxmox hosts this just lists the main OS disk (which I can do as a group since they’re identical hardware) in group_vars/proxmox_virtual_environment_hosts.yaml:

filesystems_static_luks_unlock_devices:
  # As all the nodes are identical, we can just set this on the group.
  - /dev/nvme0n1p3

As my other hosts only have one disk, retaining the support for detecting (a single) device means no changes are needed to their configuration. I then rewrote the existing block for setting the static passphrase in site.yaml to work for both scenarios:

- name: Static LUKS unlock passphrase
  hosts: static_luks_passphrase
  tasks:
    - name: Find encrypted block device (must only be one) if device list not provided
      block:
        - name: Block device and filesystem types are known
          ansible.builtin.command: /usr/bin/lsblk -o PATH,FSTYPE -J
          register: block_path_type_json
          # Read only operation - never changes anything
          changed_when: false
          check_mode: false  # Always run, even in check mode
        - name: Encrypted block devices are known
          ansible.builtin.set_fact:
            encrypted_block_devices: >-
              {{
                  (block_path_type_json.stdout | from_json).blockdevices
                  |
                  selectattr('fstype', 'eq', 'crypto_LUKS')
                  |
                  map(attribute='path')
              }}
        - name: Only one encrypted device exists
          ansible.builtin.assert:
            that:
              - encrypted_block_devices | length == 1
        - name: Encrypted device name is stored
          ansible.builtin.set_fact:
            filesystems_static_luks_unlock_device: "{{ [encrypted_block_devices | first] }}"
      when: filesystems_static_luks_unlock_devices is undefined
    # XXX Presumes all devices have the same passphrase (based on my automatic installer, which only sets one passphrase)
    - name: Static passphrase is set
      become: true
      community.crypto.luks_device:
        new_passphrase: "{{ lookup('community.hashi_vault.vault_read', 'kv/luks/static_passphrase').data.passphrase }}"
        passphrase: "{{ lookup('community.hashi_vault.vault_read', 'kv/hosts/' + inventory_hostname + '/luks/passphrase').data.passphrase }}"
        device: "{{ filesystems_static_luks_unlock_device }}"
      loop: '{{ filesystems_static_luks_unlock_devices }}'
      loop_control:
        loop_var: filesystems_static_luks_unlock_device

Repository format changes

Proxmox has updated its repository files from one-line traditional format to the newer deb822 format from version 9 (looking from their wiki, it seems their hand was forced by apt complaining about the old format starting in Debian 13 “Trixie”).

To support this, I modified my existing apt-source role to generate the new format instead of the one-line format:

- name: Configure repository
  become: yes
  ansible.builtin.copy:
    dest: /etc/apt/sources.list.d/{{ name }}.sources
    owner: root
    group: root
    mode: 00444
    # XXX deb and deb-src types can be combined (`Types: deb deb-src`) if everything else is identical.
    content: |
      Types: deb
      URIs: {{ uri }}
      Suites: {{ suite }}
      Components: {{ ' '.join(components) }}
      {% for (option, value) in apt_repository_options.items() %}
      {{ option }}: {{ value }}
      {% endfor %}

      {% if not src.no_src | default(false) -%}
      Types: deb-src
      URIs: {{ src.uri | default(uri) }}
      Suites: {{ suite }}
      Components: {{ ' '.join(src.components | default(components)) }}
      {% for (option, value) in apt_repository_options.items() %}
      {{ option }}: {{ value }}
      {% endfor %}
      {% endif -%}
  notify: update apt cache
- name: Obsolete (pre-deb822 a.k.a. 'one-line-style') source lists are removed
  become: yes
  ansible.builtin.file:
    path: /etc/apt/sources.list.d/{{ name }}.list
    state: absent
  notify: update apt cache

I also updated tasks/install.yaml to remove both versions of the enterprise repository (previously it only did the one-line version as that was the only one Proxmox created):

- name: Proxmox VE provided (unmanaged by Ansible) repositories are removed (one-line-style)
  become: true
  ansible.builtin.file:
    path: /etc/apt/sources.list.d/pve-enterprise.list
    state: absent
- name: Proxmox VE provided (unmanaged by Ansible) repositories are removed (deb822 style)
  become: true
  ansible.builtin.file:
    path: /etc/apt/sources.list.d/pve-enterprise.sources
    state: absent

GPG signing key location change

The other repository related change was to update the location of the GPG signing key, which is now only published in the Proxmox Enterprise repository and not the “no subscription” repository. I had to add that to my mirror configuration and update the filename template (which is also different, using a keyring rather than single release key, in the documentation):

in group_vars/mirror_servers.yaml:

mirror_proxies:
# [...]
  - name: proxmox-enterprise
    upstream: https://enterprise.proxmox.com/
    description: Proxmox enterprise repositories
# [...]

in group_vars/domain_home_entek_org_uk.yaml:

local_mirror:
# [...]
  proxmox-enterprise:
    uri: http://mirror.home.entek.org.uk/proxmox-enterprise
# [...]

And then I updated the mirror server with:

ansible-playbook -i inventory.yaml -t mirror site.yaml -K

Secure Boot support

With Secure Boot enabled, on first boot with new Proxmox Kernel I got error: bad shim signature. but the Debian Kernel still booted, so this was a case of needing the correct shim (which was installed by proxmox-secure-boot-support). Installing the proxmox-secure-boot-support package pulled this in. I added it to the list of packages installed alongside the kernel. On rebooting the first node, installing Proxmox kernel has removed the DHCP client, so I also had to manually specify that one needed to be installed so the system could connect to the network - this seems to be a known issue:

- name: Proxmox kernel is installed
  become: true
  ansible.builtin.package:
    name:
      - proxmox-default-kernel
      - proxmox-secure-boot-support
      # https://forum.proxmox.com/threads/proxmox-9-0-4-doesnt-get-ip-by-dhcp.169721/
      - isc-dhcp-client
  register: kernel_updated

Cluster join using fully-qualified name

As the cluster is using Let’s Encrypt certificates, joining with the short name fails validation (I am sure this was working before, so it may be that the old version added the default domain and Proxmox 9 doesn’t?). I just appended the detected domain from the node joining to the hostname of the first contactable member of the cluster:

- name: Node is joined to cluster
  become: true
  ansible.builtin.expect:
    command: /usr/bin/pvecm add {{ pve_cluster_join_target_fqdn | ansible.builtin.quote }}
    responses:
      "Please enter superuser \\(root\\) password for '[^']+':": >-
        {{
          lookup(
            'community.hashi_vault.vault_read',
            'kv/hosts/' + pve_cluster_join_target + '/users/root'
          ).data.password
        }}
  timeout: 90  # Sometimes takes longer than 30s default
  vars:
    pve_cluster_join_target: '{{ pve_cluster_nodes_clustered | first }}'
    # Need to use FQDN with Let's Encrypt certificate
    pve_cluster_join_target_fqdn: '{{ pve_cluster_join_target }}.{{ ansible_facts.domain }}'
  when: inventory_hostname not in pve_cluster_nodes_clustered

Proxmox Ceph configuration changes

When I added the 2.5GbE network, I manually changed the configuration. This is a decision that came back to haunt me with the reinstall as I now had to update the automation scripts - something I wish I had done at that time, with hindsight.

The original Ceph automation configured a Ceph cluster based on if there were multiple Proxmox nodes in the cluster or not (i.e. if this was a HA cluster, install and configure Ceph and not if it’s a stand-alone machine). As I was updating the Ceph code anyway, I changed this to be directly controlled by a new variable pve_ceph_enabled. I also added new variables for the Ceph networks and the OSD devices to use (previously the LVM logical volume name was hardcoded in the task file). The lists of monitor and metadata servers was already configured (used in the existing tasks/ceph.yaml and specified in my inventory data for the servers) but for some reason I had neglected to add them to the argument specification. The added section of the role’s meta/argument_specs.yaml is:

pve_ceph_enabled:
  description: Whether to configure Ceph with Proxmox or not
  type: bool
  default: false
pve_ceph_network:
  description: Network (CIDR format) to use for Ceph public network
  type: str
  required: false
pve_ceph_private_network:
  description: Network (CIDR format) to use for Ceph private network (requires pve_ceph_network)
  type: str
  required: false
pve_ceph_monitors:
  description: List of hosts to configure as Ceph monitors (required if configuring Ceph)
  type: list
  elements: str
  required: false
pve_ceph_metadata_servers:
  description: List of hosts to configure as Ceph metadata servers (required if configuring Ceph)
  type: list
  elements: str
  required: false
pve_ceph_osd_devices:
  description: List of devices to configure OSDs for. Values with a leading are presumed to be block devices, those without are presumed to be logical volume names (e.g. vg/lv_ceph).
  type: list
  elements: str
  required: false

The only change to the defaults file, defaults/main.yaml, was to set the default value to not configure Ceph:

pve_ceph_enabled: false

For my inventory hosts, I set the values for these new variables in group_vars/proxmox_virtual_environment_hosts.yaml (alongside the existing pve_ceph_monitors and pve_ceph_metadata_servers variables):

pve_ceph_enabled: true
pve_ceph_network: '192.168.10.0/24'
pve_ceph_private_network: '172.16.0.0/28'
pve_ceph_osd_devices:
  # As all the nodes are identical, we can just set this on the group.
  - /dev/sda

New Ceph tasks

Whether or not Ceph is configured is controlled by a block in my proxmox-virtual-environment role’s tasks/ceph.yaml, which originally had when: pve_cluster_nodes is defined as the condition (always configuring Ceph if this is a clustered Proxmox install). With the new variable, this condition simply became when: pve_ceph_enabled.

Since Ansible’s role argument validation only supports simple required/not-required for arguments, I added a check that all required variables are present with some assertions at the start of the Ceph tasks:

- name: Validate elements required for Ceph are set (servers)
  tags: always
  ansible.builtin.assert:
    that:
      - pve_ceph_monitors is defined
      - pve_ceph_monitors | length > 0
      - pve_ceph_metadata_servers is defined
      - pve_ceph_metadata_servers | length > 0
    fail_msg: If Ceph is to be configured, pve_ceph_monitors and pve_ceph_metadata_servers are required
- name: Validate elements required for Ceph are set (networks)
  tags: always
  ansible.builtin.assert:
    that:
      - pve_ceph_network is defined
    fail_msg: If Ceph private network is to be configured, both pve_ceph_network and pve_ceph_private_network are required
  when: pve_ceph_private_network is defined

Proxmox’s pveceph init command takes the networks as an argument, so configuring this is simple when the cluster is setup:

- name: Ceph cluster is initialised if no config file found
  run_once: true
  become: true
  ansible.builtin.command: >-
    /usr/bin/pveceph init
    {% if pve_ceph_network is defined %}
    --network {{ pve_ceph_network | ansible.builtin.quote }}
    {% if pve_ceph_network is defined %}
    --cluster-network {{ pve_ceph_private_network | ansible.builtin.quote }}
    {% endif %}
    {% endif %}
  when: not ceph_conf_stat.stat.exists

I also added a task to create the symlink from /etc/ceph/ceph.conf to /etc/pve/ceph.conf so that my Ceph query commands (which I have to ensure idempotence of the tasks) will not fail. This is necessary because the Proxmox commands (which would create the link if it is missing) do not provide the ability to query the details I need to determine whether or not to run the commands to change the cluster state.

# There is an edge-case when adding a new node (including after removing
# and rebuilding a node) to an existing cluster, where init (which
# initially creates the symlilnk on all nodes) will not get triggered so
# the symlink will be missing on the new one.
- name: Ceph config symlink exists
  become: true
  ansible.builtin.file:
    path: /etc/ceph/ceph.conf
    src: /etc/pve/ceph.conf
    state: link

In order to configure (potentially) multiple OSDs efficiently, the tasks for configuring an OSD needed to be pulled out into a separate task file to use with a loop (as blocks cannot be looped over). The block for configuring the main tasks/ceph.yaml became relatively simple after this:

- name: OSD is setup
  block:
    # These are used in ceph-configure-osd.yaml but only need to be looked up once (for efficiency)
    - name: Current LVM list is known
      become: true
      ansible.builtin.command: /usr/sbin/ceph-volume lvm list --format json
      register: ceph_volume_lvm_list_out
      check_mode: false  # Always run, even in check mode
      changed_when: false  # Read-only operation
    - name: This system's OSDs are known
      ansible.builtin.set_fact:
        pve_ceph_local_osds: >-
          {{
            ceph_volume_lvm_list_out.stdout
            | from_json
          }}
    - name: Configure each OSD
      include_tasks: ceph-configure-osd.yaml
      loop: '{{ pve_ceph_osd_devices | default([]) }}'
      loop_control:
        loop_var: pve_ceph_osd_device

The new tasks/ceph-configure-osd.yaml is a little more complicated, retaining support for using an existing LVM logical volume as well as raw disks (as I am migrating these hosts to). For full block device, I used --encrypted 1 to the pveceph osd create command (on the old configuration, the logical volume was on an already encrypted LVM PV):

---
# XXX assume devices not starting with a '/' are LVM volumes, those with are physical devices
# see: https://docs.ceph.com/en/latest/ceph-volume/lvm/prepare/#bluestore
- name: Device path is set (block device path)
  ansible.builtin.set_fact:
    pve_ceph_osd_device_path: '{{ pve_ceph_osd_device }}'
    pve_ceph_osd_device_is_lvm: false
  when: pve_ceph_osd_device.startswith('/')
- name: Device path is set (logical volume)
  ansible.builtin.set_fact:
    pve_ceph_osd_device_path: '/dev/{{ pve_ceph_osd_device }}'
    pve_ceph_osd_device_is_lvm: true
  when: not pve_ceph_osd_device.startswith('/')
- name: Device is not configured if no local OSDs
  ansible.builtin.set_fact:
    pve_ceph_osd_device_configured: false
  when: 'pve_ceph_local_osds == {}'
- name: This device is configured is known (block device)
  ansible.builtin.set_fact:
    pve_ceph_osd_device_configured: "{{ pve_ceph_osd_device_configured or (pve_ceph_local_osd.value | selectattr('devices', 'ansible.builtin.contains', pve_ceph_osd_device_path) | length > 0) }}"
  vars:
    pve_ceph_osd_device_configured: false  # Fact takes precedence once defined
  loop: '{{ pve_ceph_local_osds | dict2items }}'
  loop_control:
    loop_var: pve_ceph_local_osd
  when: not pve_ceph_osd_device_is_lvm
- name: This device is configured is known (logical volume)
  ansible.builtin.set_fact:
    pve_ceph_osd_device_configured: "{{ pve_ceph_osd_device_configured or (pve_ceph_local_osd.value | selectattr('path', 'eq', pve_ceph_osd_device_path) | length > 0) }}"
  vars:
    pve_ceph_osd_device_configured: false  # Fact takes precedence once defined
  loop: '{{ pve_ceph_local_osds | dict2items }}'
  loop_control:
    loop_var: pve_ceph_local_osd
  when: pve_ceph_osd_device_is_lvm
# - ansible.builtin.debug: var=pve_ceph_osd_device
# - ansible.builtin.debug: var=pve_ceph_osd_device_path
# - ansible.builtin.debug: var=pve_ceph_local_osds
# - ansible.builtin.debug: var=pve_ceph_osd_device_configured
# For plain block devices, can use pveceph
- name: OSD is configured (block device)
  become: true
  ansible.builtin.command: /usr/bin/pveceph osd create '{{ pve_ceph_osd_device | ansible.builtin.quote }}' --encrypted 1 
  when: not (pve_ceph_osd_device_is_lvm or pve_ceph_osd_device_configured)
# pveceph doesn't support adding lvm volumes directly, so have to do this
# with ceph itself. I assume that the LV has been created on an already
# encrypted PV.
- name: OSD is configured (logical volume)
  block:
    - name: Bootstrap keyring is correct
      become: true
      ansible.builtin.shell:
        cmd: umask 0077 ; ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring
        creates: /var/lib/ceph/bootstrap-osd/ceph.keyring
    - name: OSD volume is added
      become: true
      ansible.builtin.command: /usr/sbin/ceph-volume lvm create --data '{{ pve_ceph_osd_device | ansible.builtin.quote }}'
  when: pve_ceph_osd_device_is_lvm and not pve_ceph_osd_device_configured
...

Applying the changes

Once the above changes were made, I just ran the main Ansible site playbook against the reinstalled node. Remarkably, it worked smoothly on all but one host without any further tweaks:

ansible-playbook -i inventory.yaml -l pve04 site.yaml

After being reinstalled, the backup server needed the new host keys in order to be able to connect again:

ansible-playbook -i inventory.yaml -t backuppc -l backuppc_servers site.yaml

Other niggles

First run failed due to chicken-and-egg problem with Vault; if the SSL certificates aren’t setup you cannot start Vault - however, when first setting up Vault, I ordered it this way because if Vault isn’t available then you cannot get the certificates from it. I re-ordered to setup Vault after setting up the host certificates in my site.yaml (it was the other way around, from when Vault was initially installed, but I had included a comment about exactly this problem). I think a separate “bootstrap a new Vault cluster” playbook is required for the case where no certificate exists.

There is also a race condition in my new SSL automatic reloading I found doing this, that the service might not yet exist during the initial run through (if the service is installed later). I added ignore_errors: true to the reload/restart service tasks for certificate changes as a workaround. Not sure I like ignoring errors but if the service ends up not running and should be, either monitoring or another state should pick it up.

On one host, the 2.5GbE USB device stopped working until re-plugged but I wonder if I possibly knocked while plugging in IP KVM, which I was doing more by feel than sight.

I also had a frustrating problems where ceph-mon wouldn’t rejoin the cluster on the 2nd node I upgraded. After checking everything was correct (and getting nowhere troubleshooting it for 5 days - it was very frustrating), restarting the other mons (all of them, one-by-one) resolved it and it joined fine. Not sure what the underlying problem with it was and wish I had thought to try restarting the mons that I had not touched earlier - I got the idea from a Proxmox forum post I found on the 5th day trying to figure out what had gone wrong.

On pve05, I readded the USB devices removed (to migrate away) the VMs that were on it previously.

After the last node was upgraded, I got a warning (all OSDs are running squid or later but require_osd_release < squid) in Ceph and needed to raise the minimum OSD version to clear it, with ceph osd require-osd-release squid. This is documented in Proxmox’s Ceph Reef to Squid guide.