SSL and clustering live Proxmox Virtual Environment

This follows on from my post building the cluster hosts to standalone Proxmox Virtual Environment servers and covers SSL certificates and clustering them. I last clustered Proxmox Virtual Environment in my air-gapped lab back in 2022.

SSL certificate

Obtaining the certificate

My Let’s Encrypt configuration is still in SaltStack. I looked at adding pve.home.entek.org.uk and each host to my domains via it but the Ansible managed configuration deviates from the current Salt configuration sufficiently that it would take a while to unpick, so instead I manually added it to the configuration.

I need to migrate my Let’s Encrypt configuration to Ansible. At present most of the certificates are imported into my HashiCorp Vault and I am unsure what my long-term plan is for this. I might consolidate the fetching of all certificates to one hardened server that puts them into the vault (as opposed to the current means of most servers managing their own certificates), which reduces the number of systems with internet access, or I might move back towards servers managing their own certificates, which reduces the attack surface (as the certificates, and more importantly private keys, only exist on the server and the latter need never be transmitted over the network).

I requested the new certificate this by adding a single line to /etc/dehydrated/domains.txt:

pve.home.entek.org.uk pve01.home.entek.org.uk pve02.home.entek.org.uk pve03.home.entek.org.uk pve04.home.entek.org.uk pve05.home.entek.org.uk

I then ran the script to generate the initial certificates:

mkdir /etc/ssl/pve.home.entek.org.uk
chown dehydrated /etc/ssl/pve.home.entek.org.uk
/usr/local/sbin/dehydrated-update-certs

Finally, I imported them into the vault manually:

vault write /kv/ssl/certs/hosts/pve.home.entek.org.uk/ca bundle=@/etc/ssl/pve.home.entek.org.uk/chain.pem
vault write /kv/ssl/certs/hosts/pve.home.entek.org.uk/certificate certificate=@/etc/ssl/pve.home.entek.org.uk/cert.pem
vault write /kv/ssl/certs/hosts/pve.home.entek.org.uk/key key=@/etc/ssl/pve.home.entek.org.uk/privkey.pem

Deploying the SSL certificate to Proxmox Virtual Environment

Proxmox generates its own internal certificate and authority and manages its certificates signed by this itself. It uses a proxy, pveproxy, to present itself to the network and it is pveproxy that needs to be configured to use this certificate. More details can be found in the PVE Certificate Management documentation. The certificate and key files are /etc/pve/local/pveproxy-ssl.pem and /etc/pve/local/pveproxy-ssl.key respectively.

The /etc/pve filesystem is a fuse mounted view of a SQLite database, and is not a POSIX compliant filesystem. As a result, ansible.builtin.copy will not work (as it always tries to chmod the target file). Once solution is to copy the certificate (and bundle) and key to temporary files and then copy to the target if needed, using diff to confirm if this is needed to maintain idempotence:

- name: pveproxy SSL certificates are correct, if provided
  block:
    - name: Certificate temporary file exists
      ansible.builtin.tempfile:
      register: cert_temp_file
    - name: Key temporary file exists
      ansible.builtin.tempfile:
      register: key_temp_file
    - name: Certificate to be configured is in tempfile
      ansible.builtin.copy:
        dest: '{{ cert_temp_file.path }}'
        mode: 00440
        content: "{{ pve_pveproxy_certificate.certificate }}\n{{ pve_pveproxy_certificate.ca_bundle }}"
    - name: Key to be configured is in tempfile
      ansible.builtin.copy:
        dest: '{{ key_temp_file.path }}'
        mode: 00400
        content: '{{ pve_pveproxy_certificate.key }}'
      # Do not display keys
      no_log: true
    # /etc/pve is a FUSE view of a SQLite database, permissions
    # cannot be set and are handled by the fuse driver.
    # see: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)
    # So ansible's copy module cannot be used:
    # see: https://github.com/ansible/ansible/issues/19731
    # and: https://github.com/ansible/ansible/issues/40220
    - name: Certificate chain is correct
      become: yes
      ansible.builtin.shell: >-
        diff {{ cert_temp_file.path }} /etc/pve/local/pveproxy-ssl.pem
        && echo "Cert correct"
        || cp {{ cert_temp_file.path }} /etc/pve/local/pveproxy-ssl.pem
      register: output
      changed_when: 'output.stdout != "Cert correct"'
      notify: Restart pveproxy
    - name: Certificate key is correct
      become: yes
      ansible.builtin.shell: >-
        diff {{ key_temp_file.path }} /etc/pve/local/pveproxy-ssl.key
        && echo "Key correct"
        || cp {{ key_temp_file.path }} /etc/pve/local/pveproxy-ssl.key
      register: output
      changed_when: 'output.stdout != "Key correct"'
      notify: Restart pveproxy
      # Do not display keys
      no_log: true
    - name: Temporary certificate file is removed
      ansible.builtin.file:
        path: '{{ cert_temp_file.path }}'
        state: absent
    - name: Temporary key file is removed
      ansible.builtin.file:
        path: '{{ key_temp_file.path }}'
        state: absent
  when: pve_pveproxy_certificate is defined

This is the first variable that the proxmox_virtual_environment role accepts, so I created its meta/argument_specs.yaml:

---
argument_specs:
  main:
    short_description: Installs and configures Proxmox Virtual Environment
    author: Laurence Alexander Hurst
    options:
      pve_pveproxy_certificate:
        description: Certificate for pveproxy (user-facing interface to Proxmox Virtual Environment)
        type: dict
        required: false
        options:
          certificate:
            description: PEM encoded certificate to use
            required: true
            type: str
          ca_bundle:
            description: PEM encoded CA bundle for the certificate
            required: true
            type: str
          key:
            description: PEM encoded private key for the certificate
            required: true
            type: str
...

It also uses a handler, to restart pveproxy if the certificate is updated, so I also added handlers/main.yaml:

---
- name: Restart pveproxy
  become: yes
  ansible.builtin.service:
    name: pveproxy
    state: restarted
...

Finally, I added the lookup for the certificates added to the vault to my proxmox_virtual_environment_hosts group:

pve_pveproxy_certificate:
  certificate: >-
    {{
      lookup(
        'community.hashi_vault.vault_read',
        'kv/ssl/certs/hosts/pve.home.entek.org.uk/certificate'
      ).data.certificate
    }}
  ca_bundle: >-
    {{
      lookup(
        'community.hashi_vault.vault_read',
        'kv/ssl/certs/hosts/pve.home.entek.org.uk/ca'
      ).data.bundle
    }}
  key: >-
    {{
      lookup(
        'community.hashi_vault.vault_read',
        'kv/ssl/certs/hosts/pve.home.entek.org.uk/key'
      ).data.key
    }}

NTP

As noted in my last post I needed to setup NTP (XXX This is a lie - NTP not yet setup. Should sort that.) and in my previous attempt at clustering Proxmox VE and Ceph, in particular, are extremely sensitive to clock-skew (<0.05s).

Proxmox uses Chrony as its NTP client, when I configured a Linux NTP server in my lab network I was using SaltStack so I had not yet made an Ansible role to configure NTP clients.

I created an ntp role that can configure either chrony or systemd-timesyncd, just as my Salt Stack configuration could. Using the same pattern I used for configuring different DHCP server software, I used ansible.builtin.include_tasks in the role’s tasks/main.yaml to include the appropriate tasks for the package at hand:

---
- name: Include appropriate client configuration
  ansible.builtin.include_tasks: 'client-{{ ntp_client_software }}.yaml'
...

The tasks for chrony and systemd-timesyncd are broadly similar:

---
# Don't want more than one NTP client installed...
- name: timesyncd is not installed
  become: true
  ansible.builtin.package:
    name: systemd-timesyncd
    state: absent
- name: chrony is installed
  become: true
  ansible.builtin.package:
    name: chrony
    state: present
- name: NTP server is configured
  become: true
  ansible.builtin.copy:
    dest: /etc/chrony/conf.d/local-ntp-server.conf
    content: server {{ ntp_server }} iburst
  notify: Restart chrony
...

and

---
# Don't want more than one NTP client installed...
- name: chrony is not installed
  become: true
  ansible.builtin.package:
    name: chrony
    state: absent
- name: timesyncd is installed
  become: true
  ansible.builtin.package:
    name: systemd-timesyncd
    state: present
- name: NTP server is configured
  become: true
  community.general.ini_file:
    path: /etc/systemd/timesyncd.conf
    section: Time
    option: NTP
    value: '{{ ntp_server }}'
  notify: Restart systemd-timesyncd
...

The handlers/main.yaml contains the restart handlers for both packages, which are very straightforward:

---
- name: Restart systemd-timesyncd
  become: true
  ansible.builtin.service:
    name: systemd-timesyncd
    state: restarted
- name: Restart chrony
  become: true
  ansible.builtin.service:
    name: chrony
    state: restarted
...

Finally, the meta/argument_specs.yaml and defaults/main.yaml which define the two settings currently supported (package to configure, defaulting to systemd-timesyncd, and the NTP server, defaulting to ntp):

---
argument_specs:
  main:
    short_description: Manage NTP client configuration
    author: Laurence Alexander Hurst
    options:
      ntp_client_software:
        description: NTP client software pacakge to use
        type: str
        default: systemd-timesyncd
        choices:
          - systemd-timesyncd
          - chrony
      ntp_server:
        description: IP or DNS name of NTP server to sync with
        type: str
        default: ntp
...

with defaults:

---
ntp_client_software: systemd-timesyncd
ntp_server: ntp
...

For the proxmox_virtual_environment_hosts group, I added setting it to the chrony software to its group_vars file and, for now, the network-specific ntp address - as noted in the comment, I hope to replace this with per-network resolution of ntp using bind’s views:

ntp_client_software: chrony
# XXX This is temporary - once DNS views are setup, each network can have correctly resolving `ntp`
ntp_server: ntp-mgmt

I added this role to all of my hosts, via an existing hosts: all:!dummy play, since having time synced is important on most systems.

Clustering the nodes

Proxmox

To make this as automated as possible, there’s two scenarios that I can fully automate:

The cluster is already setup and one of the other cluster nodes can be contacted, in which case the node being configured can join the cluster.
The cluster is not setup, in which case a new cluster can be created (on any node). To confirm this, all nodes that should be in the cluster need to be contactable to check none of them are setup. The new cluster needs to be setup on only one (then the rest can be joined to it).

Both cases require knowing the other nodes in the cluster (to find one to join and/or check they are not configured), and creating a new cluster requires a name for the cluster. So I added these to the proxmox-virtual-environment role’s argument_specs.yaml:

pve_cluster_nodes:
  description: Complete list of nodes (expected to be Ansible inventory hosts and resolvable within the proxmox nodes) that are part of the cluster
  type: list
  elements: str
  required: false
pve_cluster_name:
  description: Name to be used if creating a new cluster
  type: str
  default: pvecluster

I also added the default value to the role’s defaults/main.yaml:

---
pve_cluster_name: pvecluster
...

Initially, I found the status of the cluster nodes in the role’s tasks/main.yaml but when I added the task to create a new cluster I decided to put it in its own tasks file and reuse it to rescan for the newly setup node after so the join process works the same for the other nodes whether the cluster is new or already existed.

The pvecm status command prints that /etc/pve/corosync.conf is missing and that (the missing file) suggests the node is not part of a cluster. So this seemed like a sensible test for the node being clustered. By delegating to each node, all of the nodes will get the facts set. The code sets two facts, a list of nodes that were reached (pve_cluster_nodes_contactable) and a list that are in a cluster, i.e. /etc/pve/corosync.conf exists (pve_cluster_nodes_clustered):

---
- name: Cluster file is stated on all nodes in cluster
  delegate_to: '{{ item }}'
  ansible.builtin.stat:
    path: /etc/pve/corosync.conf
  # Don't fail if a node is down, might not be fatal - only need one node
  # available to join an existing cluster, for example.
  failed_when: false
  register: pve_corosync_conf_stat
  loop: '{{ pve_cluster_nodes }}'
- name: Clustered and contactable node lists are initialised to empty lists
  ansible.builtin.set_fact:
    pve_cluster_nodes_clustered: []
    pve_cluster_nodes_contactable: []
- name: Nodes that are clustered are known
  ansible.builtin.set_fact:
    pve_cluster_nodes_clustered: >-
      {{
        pve_cluster_nodes_clustered
        +
        [item.item]
      }}
  loop: '{{ pve_corosync_conf_stat.results }}'
  when: not item.failed and item.stat.exists
- name: Nodes that are contactable are known
  ansible.builtin.set_fact:
    pve_cluster_nodes_contactable: >-
      {{
        pve_cluster_nodes_contactable
        +
        [item.item]
      }}
  loop: '{{ pve_corosync_conf_stat.results }}'
  when: not item.failed
- ansible.builtin.debug:
    msg: 'Nodes already in cluster: {{ pve_cluster_nodes_clustered | join(", ") }}'
- ansible.builtin.debug:
    msg: 'Nodes that can be reached and should be in cluster: {{ pve_cluster_nodes_contactable | join(", ") }}'
...

Finally, the tasks added to the role’s tasks/main.yaml to actually do the clustering if pve_cluster_nodes variable is set. On a new node I found that pveproxy had not yet been restarted after the certificate was updated, which resulted in certificate errors so I added a flushing of handlers to ensure the handler to restart pveproxy had run before the clustering, if required. I also had to set a longer timeout on the cluster join command as I found it was timing out occasionally.

- name: Handlers have flushed, so pveproxy certificates are correct
  ansible.builtin.meta: flush_handlers
- name: Cluster is joined
  block:
    # There are two situations we can automatically deal with:
    # 1. There is an existing clustered node to join with.
    # or
    # 2. All of the clustered nodes are contactable and none are in
    #    a cluster (in which case a new cluster needs to be setup).
    - name: Clustered node status is known
      ansible.builtin.include_tasks: clustered-nodes-status.yaml
    - name: Cluster is setup
      run_once: true
      become: true
      ansible.builtin.command:
        cmd: /usr/bin/pvecm create {{ pve_cluster_name }}
      when: pve_cluster_nodes_clustered | length == 0 and pve_cluster_nodes_contactable | length == pve_cluster_nodes | length
    - name: Clustered node status is refreshed if newly setup
      ansible.builtin.include_tasks: clustered-nodes-status.yaml
      when: pve_cluster_nodes_clustered | length == 0 and pve_cluster_nodes_contactable | length == pve_cluster_nodes | length
    - name: Have a node to join with
      ansible.builtin.assert:
        that: pve_cluster_nodes_clustered | length > 0
        fail_msg: No existing cluster nodes found to join with.
    - name: Node is joined to cluster
      become: true
      ansible.builtin.expect:
        # Need to use FQDN with Let's Encrypt certificate
        command: /usr/bin/pvecm add {{ pve_cluster_join_target }}.{{ ansible_facts.domain }}
        responses:
          "Please enter superuser \\(root\\) password for '[^']+':": >-
            {{
              lookup(
                'community.hashi_vault.vault_read',
                'kv/hosts/' + pve_cluster_join_target + '/users/root'
              ).data.password
            }}
      timeout: 90  # Sometimes takes longer than 30s default
      vars:
        pve_cluster_join_target: '{{ pve_cluster_nodes_clustered | first }}'
      when: inventory_hostname not in pve_cluster_nodes_clustered
  when: pve_cluster_nodes is defined

In the proxmox_virtual_environment_hosts’s group variables, I set the pve_cluster_nodes variable to be a lookup of all inventory hosts in that group. This (the use of a group, as well as the group name) are site-specific so best to keep outside the role:

pve_cluster_nodes: >-
  {{
    query(
      'ansible.builtin.inventory_hostnames',
      'proxmox_virtual_environment_hosts'
    )
  }}

Ceph

Configuring LVM with Ansible

In order to setup Ceph, I need to setup the disks for it to use. I have been putting this off, and even hacked my automated Debian preseed as I ran out of space installing ProxmoxVE in the first place. The preseed is intended to produce installs for small disks, for VM use, and my intention was always that Ansible would resize as required - keeping the preseed as a “one size fits all” bootstrap with all customisation done through Ansible. Doing this work also enabled me to set custom (lvm) filesystem sizes generally.

To configure the volumes, I added a new role called filesystems (with the intention it might be used for more than just lvm, if needed, in the future) and configured arguments that allow it to manage volume groups and logical volumes within them:

---
argument_specs:
  main:
    description: Configure filesystems on the target
    options:
      filesystems_lvm_volume_groups:
        description: Logical volume groups to configure
        type: list
        elements: dict
        default: []
        options:
          name:
            description: Name of the volume group
            type: str
            required: true
          logical_volumes:
            description: List of logical volumes to manage
            type: list
            elements: dict
            options:
              name:
                description: Name of the logical volume
                type: str
                required: true
              size:
                description: >-
                  Size of the logical volume (see
                  <https://docs.ansible.com/ansible/latest/collections/community/general/lvol_module.html#parameter-size>
                  for valid values)
                type: str
                required: false
...

The corresponding defaults/main.yaml file:

---
filesystems_lvm_volume_groups: []
...

The main.yaml file is just a loop over the volume groups, the default of the empty list means that nothing will happen on hosts with no filesystems_lvm_volume_groups set (making it safe to apply to all hosts):

---
- name: Logical volumes exist and are correct sizes
  ansible.builtin.include_tasks: lvm.yaml
  vars:
    filesystems_lvm_lvs: '{{ lvm_item.logical_volumes }}'
    filesystems_lvm_vg: '{{ lvm_item.name }}'
  loop: '{{ filesystems_lvm_volume_groups }}'
  loop_control:
    # Avoid clashing with inner loop
    loop_var: lvm_item
...

The lvm.yaml tasks file just ensures the each logical volume in the group is correct:

---
- name: All logical volumes are correct
  become: true
  community.general.lvol:
    lv: '{{ item.name }}'
    resizefs: true
    size: '{{ item.size }}'
    vg: '{{ filesystems_lvm_vg }}'
  loop: '{{ filesystems_lvm_lvs }}'
...

I hardcoded resizefs to be true - this means that if the volume grows and it’s one of the list supported (ext2, ext3, ext4, reiserfs and XFS at time of writing) the filesystem will be resized by the module. However, if an unsupported volume is resized then the module will fail.

I also did not override the default force setting of false, which means attempts to shrink a volume will also cause the module to fail.

I applied the new role to all Linux systems by add it to the existing play that targets all:!dummy (all real hosts) but using an include_role task with a condition (when) to only apply to Linux systems:

- name: Filesystems are configured on Linux systems
  ansible.builtin.import_role:
    name: filesystems
  when: ansible_facts.system == 'Linux'

Finally, I added the larger root volume and creation of the ceph_osd volume to proxmox_virtual_environment_hosts’s group variables:

filesystems_lvm_volume_groups:
  - name: vg_{{ inventory_hostname }}
    logical_volumes:
      - name: root
        size: 5G
      - name: ceph_osd
        size: 200G

Installing Ceph

I stated with a block, and presumed (per the comment) that all cluster nodes will also be part of the Ceph cluster:

# XXX Presumes PVE cluster nodes will always also be ceph cluster
- name: Ceph is configured (for clusters)
  block:
  #...
  when: pve_cluster_nodes is defined

As I have previously configured the Ceph repository, all I needed to do was install the packages and reload the ProxmoxVE services (based on a forum post on Proxmox’ forums) as the first tasks in this block:

- name: Ceph packages are installed
  become: true
  ansible.builtin.package:
    name: ceph
    state: present
  notify:
    -  Reload pveproxy
    -  Reload pveademon
- name: Ensure handlers are flushed (to reload daemons if Ceph just installed)
  meta: flush_handlers

This uses two new handlers, which I added to the role’s handlers/main.yaml file:

- name: Reload pveproxy
  become: yes
  ansible.builtin.service:
    name: pveproxy
    state: reloaded
- name: Reload pvedaemon
  become: yes
  ansible.builtin.service:
    name: pvedaemon
    state: reloaded

Initialising the Ceph cluster

I determined from running pveceph status that ProxmoxVE itself uses the presence of /etc/pve/ceph.conf to determine if Ceph has been initialised or not:

$ pveceph status
pveceph configuration not initialized - missing '/etc/pve/ceph.conf', missing '/etc/pve/ceph'

So I used the same test in my Ansible playbook:

- name: Ceph config file is stated
  ansible.builtin.stat:
    path: /etc/pve/ceph.conf
  register: ceph_conf_stat
- name: Ceph cluster is initialised if no config file found
  run_once: true
  become: true
  ansible.builtin.command: /usr/bin/pveceph init
  when: not ceph_conf_stat.stat.exists

Firewall and monitors

The next step in installing Ceph is setting up the monitors. Configuring the firewall also requires knowing whether a node is a monitor or not as the firewall requirements are slightly different.

Advice varies on how many monitors is required - Proxmox’ documentation says to deploy exactly 3:

For high availability, you need at least 3 monitors. One monitor will already be installed if you used the installation wizard. You won’t need more than 3 monitors, as long as your cluster is small to medium-sized. Only really large clusters will require more than this.

The Ceph documentation has a slightly different view:

For small or non-critical deployments of multi-node Ceph clusters, it is recommended to deploy three monitors. For larger clusters or for clusters that are intended to survive a double failure, it is recommended to deploy five monitors. Only in rare circumstances is there any justification for deploying seven or more monitors.

And explicitly says five monitors is recommended for 5 or more nodes:

A typical Ceph cluster has three or five monitor daemons that are spread across different hosts. We recommend deploying five monitors if there are five or more nodes in your cluster.

However, I found one Proxmox forum post saying that 5 monitors is overkill for 15 node deployment:

3 monitors are fine for small to medium (and 15 OSD nodes is definitely still that category for Ceph ;)) clusters

In short, advice is somewhat contradictory. I initially followed Proxmox’ advice and set up 3 monitors but later expanded this to 5 as I am aiming for a cluster that can survive a double node failure. However, initially setting up 3 monitors means I have the right template for when I roll this out to my home lab, which has 10 nodes and will also have 5 monitors (i.e. not all nodes will be monitors).

Rather than try and be too clever and calculate which nodes will be monitors based on number of monitors and available nodes, I decided to explicitly list the monitors by adding a variable to the proxmox_virtual_environment’s group_vars.

Originally, I set this to the odd numbered nodes:

# Use odd number nodes as monitors in my 5 node cluster
pve_ceph_monitors:
  - pve01
  - pve03
  - pve05

But then replaced it with all 5 nodes (done this way as the same node list will also work unmodified for the 10 node cluster):

# Use first 5 nodes as monitors
pve_ceph_monitors:
  - pve01
  - pve02
  - pve03
  - pve04
  - pve05

To make life easy, I added a task to set a boolean based on if this node is a monitor or not:

- name: If this host should be a monitor is known
  ansible.builtin.set_fact:
    pve_is_monitor: '{{ inventory_hostname in pve_ceph_monitors }}'

Firewall

Firewalld comes with service definitions for Ceph (ceph) and Ceph monitors (ceph-mon), so just need to enable/disable them as appropriate. The firewall needs to configured before the monitors can be setup as they need to be able to communicate to configure.

# Firewall needs to be configured before ceph monitors are
# setup (or they won't be able to communicate).
# For firewall details, see:
# https://docs.ceph.com/en/reef/rados/configuration/network-config-ref/
- name: firewalld is configured
  block:
    # ceph and ceph-mon services come with firewalld
    # XXX Want to make this better (properly zoned, not just opened to everything via default zone)
    - name: Ceph metadata/magnager/osd service is allowed
      become: yes
      ansible.posix.firewalld:
        service: ceph
        permanent: true
        state: enabled
      notify: reload firewalld
    - name: Ceph Monitor service is allowed (on monitors)
      become: yes
      ansible.posix.firewalld:
        # ceph-mon comes out of the box with ceph and/or proxmoxve
        service: ceph-mon
        permanent: true
        state: enabled
      notify: reload firewalld
      when: pve_is_monitor
    - name: Ceph Monitor service is not allowed (on non monitors)
      become: yes
      ansible.posix.firewalld:
        # ceph-mon comes out of the box with ceph and/or proxmoxve
        service: ceph-mon
        permanent: true
        state: disabled
      notify: reload firewalld
      when: not pve_is_monitor
    - name: Handlers have flushed, so ceph becomes accessible if firewall has changed
      ansible.builtin.meta: flush_handlers

Monitors

To make this idempotent, I used the ceph mon metadata command to get the current monitor configuration (in JSON format), extract the list of configured monitors and configure the local host only if it is not in that list.

Ceph’s documentation recommends running the manager on all monitors, although only one manager is needed and it is not critical to the filesystem, so I did that too (using the same method to make it idempotent):

- name: Monitors are configured
  block:
    - name: Current monitor metadata is known
      become: true
      ansible.builtin.command: /usr/bin/ceph mon metadata
      register: ceph_mon_metadata_out
      changed_when: false  # Read-only operation
    - name: List of configured monitors is known
      ansible.builtin.set_fact:
        pve_configured_monitors: '{{ ceph_mon_metadata_out.stdout | from_json | map(attribute="name") }}'
    - name: Monitors are set up
      become: true
      ansible.builtin.command: /usr/bin/pveceph mon create
      when: inventory_hostname not in pve_configured_monitors
    # > In general, you should set up a ceph-mgr on each of the hosts
    # > running a ceph-mon daemon to achieve the same level of
    # > availability.
    # - <https://docs.ceph.com/en/reef/mgr/administrator/#high-availability>
    - name: Current manager metadata is known
      become: true
      ansible.builtin.command: /usr/bin/ceph mgr metadata
      register: ceph_mgr_metadata_out
      changed_when: false  # Read-only operation
    - name: List of configured managers is known
      ansible.builtin.set_fact:
        pve_configured_managers: >-
          {{
            ceph_mgr_metadata_out.stdout
            | from_json
            | map(attribute="name")
          }}
    - name: Managers are set up
      become: true
      ansible.builtin.command: /usr/bin/pveceph mgr create
      when: inventory_hostname not in pve_configured_managers
  when: pve_is_monitor

OSDs

Next, I need to setup the LVM volume created earlier to be used by Ceph. This is complicated by pveceph not supporting adding logical volumes (attempting to add one will result in an error unable to get device info for '/dev/dm-...'). It can be done with Ceph’s native tools however some extra steps are required to allow the Ceph tools to find the configuration file (the symlink from /etc/ceph/ceph.conf to /etc/pve/ceph.conf seems to get created when pveceph adds an OSD) and authenticate (but this is only required to actually add the disk).

To make this idempotent, I only attempt to add the volume if there are no OSDs for the current system - this presumes there is only one OSD. I also hardcoded the volume group template and logical volume name in the task to add it, I will need to revisit this if I change this (e.g. add a 2nd disk just for Ceph) in the future.

- name: OSD is setup
  block:
    - name: Ceph config symlink exists
      become: true
      ansible.builtin.file:
        path: /etc/ceph/ceph.conf
        src: /etc/pve/ceph.conf
        state: link
    - name: Current OSD metadata is known
      become: true
      ansible.builtin.command: /usr/bin/ceph osd metadata
      register: ceph_osd_metadata_out
      changed_when: false  # Read-only operation
    - name: This system's OSDs are known
      ansible.builtin.set_fact:
        pve_ceph_local_osds: >-
          {{
            ceph_osd_metadata_out.stdout
            | from_json
            | selectattr('hostname', 'eq', inventory_hostname)
          }}
    - name: OSD is configured
      block:
        - name: Bootstrap keyring is correct
          become: true
          ansible.builtin.shell:
            cmd: umask 0077 ; ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring
            creates: /var/lib/ceph/bootstrap-osd/ceph.keyring
        - name: OSD volume is added
          become: true
          # XXX Assumes vg and lv names follow this convention...
          ansible.builtin.command: /usr/sbin/ceph-volume lvm create --data vg_{{ inventory_hostname }}/ceph_osd
      when: pve_ceph_local_osds | length == 0

Pool

Finally, I created a Ceph pool for VM disks based on a template I found online. Finding the output-format options documentation was not easy, buried as it is in an appending to the Administration Guide and not included in the manual pages of commands that support it.

# Based on: <https://github.com/rgl/proxmox-ve-cluster-vagrant/blob/master/provision-storage.sh>
- name: Pools are setup
  run_once: true
  block:
    - name: Current pools are known
      become: true
      ansible.builtin.command: /usr/bin/pveceph pool ls --output-format json
      register: pve_ceph_pool_ls_out
      changed_when: false  # Read-only operation
    - name: VM pool is created
      # Only needs to happen on one ceph node
      run_once: true
      become: true
      # XXX pool name should be configurable, as should other settings (e.g. size (replicas) and min_size (min replicas or pool fails))
      # Size 4, min_size 2 means can tolerate 2 node failures
      # pg_num from "optimal pg number" in `pveceph pool ls`
      ansible.builtin.command: /usr/bin/pveceph pool create ceph-vm --size 4 --pg_num 32
      when: pve_ceph_pool_ls_out.stdout | from_json | selectattr('pool_name', 'eq', 'ceph-vm') | length == 0

Storage

With Ceph up and running, I need to add storage to ProxmoxVE with Ceph as the underlying store using pvesm. For now, I only created a single storage pool for VM disks (using rbd PVE storage pool type). The pvesm status command does not support a machine-parsable output format, so I had to use its exit status (determined empirically) to detect whether the pool has been setup or not. This is a single operation for the cluster:

- name: PVE Storage is configured
  run_once: true
  block:
    - name: Status of storage is known
      become: true
      ansible.builtin.command: /usr/sbin/pvesm status --storage ceph-vm
      register: pvesm_status_cephvm
      # Returns 255 if storage doesn't exist
      failed_when: pvesm_status_cephvm.rc not in [0, 255]
      changed_when: false  # Read-only operation
    - name: VM storage is created
      become: true
      ansible.builtin.command: /usr/sbin/pvesm add rbd ceph-vm --content images --krbd 0 --pool ceph-vm --username admin
      when: pvesm_status_cephvm.rc == 255

Networking

Before I can create a VM, I need to create the default bridge interface Proxmox VE expects. ProxmoxVE will consider interfaces defined in /etc/network/interfaces to be managed by it, and after modifying /etc/network/interfaces, ifreload -a must be run for ProxmoxVE to re-read it.

For now, I hardcoded this for my live network in the proxmox-virtual-environment role - this is not ideal as the general network configuration should be done at a much more general level (suitable for configuring any network interface on any system). I also hardcoded the interface names, which is another thing I had been avoiding - perferring to identify them by MAC instead in the rest of my Ansible roles.

Firstly, I needed to remove the automatic configuration from the Debian installer then add the static configuration for the bridge vmbr0:

- name: Network is configured
  # XXX this is too hardcoded wrt. interface names and needs to be at a higher (more general) level...
  block:
    - name: DHCP configured interface is removed
      become: true
      ansible.builtin.lineinfile:
        path: /etc/network/interfaces
        regexp: '{{ item }}'
        state: absent
      loop:
        - '^iface eno1 inet dhcp'
        - '^auto eno1'
        - '^allow-hotplug eno1'
    # Note comments from ProxmoxVE supplied /etc/network/interfaces:
    # > # If you want to manage parts of the network configuration manually,
    # > # please utilize the 'source' or 'source-directory' directives to do
    # > # so.
    # > # PVE will preserve these directives, but will NOT read its network
    # > # configuration from sourced files, so do not attempt to move any of
    # > # the PVE managed interfaces into external files!
    - name: PVE default bridge is configured
      become: true
      ansible.builtin.blockinfile:
        path: /etc/network/interfaces
        block: |
          iface eno1 inet manual

          auto vmbr0
          iface vmbr0 inet static
            address 192.168.10.5{{ inventory_hostname[-1] }}/24
            gateway 192.168.10.250
            bridge-ports eno1
            bridge-stp off
            bridge-fd 0
      notify: ifreload

The ifreload handler is straight forward and just added to the proxmox_virtual_environment’s handlers/main.yaml:

- name: ifreload
  become: yes
  ansible.builtin.command: /usr/sbin/ifreload -a

Onwards and upwards…

Now I have a functioning Proxmox Virtual Environment, clustered and with Ceph storage configured ready to deploy virtual machines onto, which is my next task….