SSL and clustering live Proxmox Virtual Environment
This follows on from my post building the cluster hosts to standalone Proxmox Virtual Environment servers and covers SSL certificates and clustering them. I last clustered Proxmox Virtual Environment in my air-gapped lab back in 2022.
SSL certificate
Obtaining the certificate
My Let’s Encrypt configuration is still in SaltStack. I looked at adding pve.home.entek.org.uk
and each host to my domains via it but the Ansible managed configuration deviates from the current Salt configuration sufficiently that it would take a while to unpick, so instead I manually added it to the configuration.
I need to migrate my Let’s Encrypt configuration to Ansible. At present most of the certificates are imported into my HashiCorp Vault and I am unsure what my long-term plan is for this. I might consolidate the fetching of all certificates to one hardened server that puts them into the vault (as opposed to the current means of most servers managing their own certificates), which reduces the number of systems with internet access, or I might move back towards servers managing their own certificates, which reduces the attack surface (as the certificates, and more importantly private keys, only exist on the server and the latter need never be transmitted over the network).
I requested the new certificate this by adding a single line to /etc/dehydrated/domains.txt
:
pve.home.entek.org.uk pve01.home.entek.org.uk pve02.home.entek.org.uk pve03.home.entek.org.uk pve04.home.entek.org.uk pve05.home.entek.org.uk
I then ran the script to generate the initial certificates:
mkdir /etc/ssl/pve.home.entek.org.uk
chown dehydrated /etc/ssl/pve.home.entek.org.uk
/usr/local/sbin/dehydrated-update-certs
Finally, I imported them into the vault manually:
vault write /kv/ssl/certs/hosts/pve.home.entek.org.uk/ca bundle=@/etc/ssl/pve.home.entek.org.uk/chain.pem
vault write /kv/ssl/certs/hosts/pve.home.entek.org.uk/certificate certificate=@/etc/ssl/pve.home.entek.org.uk/cert.pem
vault write /kv/ssl/certs/hosts/pve.home.entek.org.uk/key key=@/etc/ssl/pve.home.entek.org.uk/privkey.pem
Deploying the SSL certificate to Proxmox Virtual Environment
Proxmox generates its own internal certificate and authority and manages its certificates signed by this itself. It uses a proxy, pveproxy
, to present itself to the network and it is pveproxy
that needs to be configured to use this certificate. More details can be found in the PVE Certificate Management documentation. The certificate and key files are /etc/pve/local/pveproxy-ssl.pem
and /etc/pve/local/pveproxy-ssl.key
respectively.
The /etc/pve
filesystem is a fuse mounted view of a SQLite database, and is not a POSIX compliant filesystem. As a result, ansible.builtin.copy
will not work (as it always tries to chmod
the target file). Once solution is to copy the certificate (and bundle) and key to temporary files and then copy to the target if needed, using diff
to confirm if this is needed to maintain idempotence:
- name: pveproxy SSL certificates are correct, if provided
block:
- name: Certificate temporary file exists
ansible.builtin.tempfile:
register: cert_temp_file
- name: Key temporary file exists
ansible.builtin.tempfile:
register: key_temp_file
- name: Certificate to be configured is in tempfile
ansible.builtin.copy:
dest: '{{ cert_temp_file.path }}'
mode: 00440
content: "{{ pve_pveproxy_certificate.certificate }}\n{{ pve_pveproxy_certificate.ca_bundle }}"
- name: Key to be configured is in tempfile
ansible.builtin.copy:
dest: '{{ key_temp_file.path }}'
mode: 00400
content: '{{ pve_pveproxy_certificate.key }}'
# Do not display keys
no_log: true
# /etc/pve is a FUSE view of a SQLite database, permissions
# cannot be set and are handled by the fuse driver.
# see: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)
# So ansible's copy module cannot be used:
# see: https://github.com/ansible/ansible/issues/19731
# and: https://github.com/ansible/ansible/issues/40220
- name: Certificate chain is correct
become: yes
ansible.builtin.shell: >-
diff {{ cert_temp_file.path }} /etc/pve/local/pveproxy-ssl.pem
&& echo "Cert correct"
|| cp {{ cert_temp_file.path }} /etc/pve/local/pveproxy-ssl.pem
register: output
changed_when: 'output.stdout != "Cert correct"'
notify: Restart pveproxy
- name: Certificate key is correct
become: yes
ansible.builtin.shell: >-
diff {{ key_temp_file.path }} /etc/pve/local/pveproxy-ssl.key
&& echo "Key correct"
|| cp {{ key_temp_file.path }} /etc/pve/local/pveproxy-ssl.key
register: output
changed_when: 'output.stdout != "Key correct"'
notify: Restart pveproxy
# Do not display keys
no_log: true
- name: Temporary certificate file is removed
ansible.builtin.file:
path: '{{ cert_temp_file.path }}'
state: absent
- name: Temporary key file is removed
ansible.builtin.file:
path: '{{ key_temp_file.path }}'
state: absent
when: pve_pveproxy_certificate is defined
This is the first variable that the proxmox_virtual_environment
role accepts, so I created its meta/argument_specs.yaml
:
---
argument_specs:
main:
short_description: Installs and configures Proxmox Virtual Environment
author: Laurence Alexander Hurst
options:
pve_pveproxy_certificate:
description: Certificate for pveproxy (user-facing interface to Proxmox Virtual Environment)
type: dict
required: false
options:
certificate:
description: PEM encoded certificate to use
required: true
type: str
ca_bundle:
description: PEM encoded CA bundle for the certificate
required: true
type: str
key:
description: PEM encoded private key for the certificate
required: true
type: str
...
It also uses a handler, to restart pveproxy
if the certificate is updated, so I also added handlers/main.yaml
:
---
- name: Restart pveproxy
become: yes
ansible.builtin.service:
name: pveproxy
state: restarted
...
Finally, I added the lookup for the certificates added to the vault to my proxmox_virtual_environment_hosts
group:
pve_pveproxy_certificate:
certificate: >-
{{
lookup(
'community.hashi_vault.vault_read',
'kv/ssl/certs/hosts/pve.home.entek.org.uk/certificate'
).data.certificate
}}
ca_bundle: >-
{{
lookup(
'community.hashi_vault.vault_read',
'kv/ssl/certs/hosts/pve.home.entek.org.uk/ca'
).data.bundle
}}
key: >-
{{
lookup(
'community.hashi_vault.vault_read',
'kv/ssl/certs/hosts/pve.home.entek.org.uk/key'
).data.key
}}
NTP
As noted in my last post I needed to setup NTP (XXX This is a lie - NTP not yet setup. Should sort that.
) and in my previous attempt at clustering Proxmox VE and Ceph, in particular, are extremely sensitive to clock-skew (<0.05s).
Proxmox uses Chrony as its NTP client, when I configured a Linux NTP server in my lab network I was using SaltStack so I had not yet made an Ansible role to configure NTP clients.
I created an ntp
role that can configure either chrony
or systemd-timesyncd
, just as my Salt Stack configuration could. Using the same pattern I used for configuring different DHCP server software, I used ansible.builtin.include_tasks
in the role’s tasks/main.yaml
to include the appropriate tasks for the package at hand:
---
- name: Include appropriate client configuration
ansible.builtin.include_tasks: 'client-{{ ntp_client_software }}.yaml'
...
The tasks for chrony and systemd-timesyncd are broadly similar:
---
# Don't want more than one NTP client installed...
- name: timesyncd is not installed
become: true
ansible.builtin.package:
name: systemd-timesyncd
state: absent
- name: chrony is installed
become: true
ansible.builtin.package:
name: chrony
state: present
- name: NTP server is configured
become: true
ansible.builtin.copy:
dest: /etc/chrony/conf.d/local-ntp-server.conf
content: server {{ ntp_server }} iburst
notify: Restart chrony
...
and
---
# Don't want more than one NTP client installed...
- name: chrony is not installed
become: true
ansible.builtin.package:
name: chrony
state: absent
- name: timesyncd is installed
become: true
ansible.builtin.package:
name: systemd-timesyncd
state: present
- name: NTP server is configured
become: true
community.general.ini_file:
path: /etc/systemd/timesyncd.conf
section: Time
option: NTP
value: '{{ ntp_server }}'
notify: Restart systemd-timesyncd
...
The handlers/main.yaml
contains the restart handlers for both packages, which are very straightforward:
---
- name: Restart systemd-timesyncd
become: true
ansible.builtin.service:
name: systemd-timesyncd
state: restarted
- name: Restart chrony
become: true
ansible.builtin.service:
name: chrony
state: restarted
...
Finally, the meta/argument_specs.yaml
and defaults/main.yaml
which define the two settings currently supported (package to configure, defaulting to systemd-timesyncd
, and the NTP server, defaulting to ntp
):
---
argument_specs:
main:
short_description: Manage NTP client configuration
author: Laurence Alexander Hurst
options:
ntp_client_software:
description: NTP client software pacakge to use
type: str
default: systemd-timesyncd
choices:
- systemd-timesyncd
- chrony
ntp_server:
description: IP or DNS name of NTP server to sync with
type: str
default: ntp
...
with defaults:
---
ntp_client_software: systemd-timesyncd
ntp_server: ntp
...
For the proxmox_virtual_environment_hosts
group, I added setting it to the chrony
software to its group_vars
file and, for now, the network-specific ntp address - as noted in the comment, I hope to replace this with per-network resolution of ntp
using bind’s views:
ntp_client_software: chrony
# XXX This is temporary - once DNS views are setup, each network can have correctly resolving `ntp`
ntp_server: ntp-mgmt
I added this role to all of my hosts, via an existing hosts: all:!dummy
play, since having time synced is important on most systems.
Clustering the nodes
Proxmox
To make this as automated as possible, there’s two scenarios that I can fully automate:
-
The cluster is already setup and one of the other cluster nodes can be contacted, in which case the node being configured can join the cluster.
-
The cluster is not setup, in which case a new cluster can be created (on any node). To confirm this, all nodes that should be in the cluster need to be contactable to check none of them are setup. The new cluster needs to be setup on only one (then the rest can be joined to it).
Both cases require knowing the other nodes in the cluster (to find one to join and/or check they are not configured), and creating a new cluster requires a name for the cluster. So I added these to the proxmox-virtual-environment
role’s argument_specs.yaml
:
pve_cluster_nodes:
description: Complete list of nodes (expected to be Ansible inventory hosts and resolvable within the proxmox nodes) that are part of the cluster
type: list
elements: str
required: false
pve_cluster_name:
description: Name to be used if creating a new cluster
type: str
default: pvecluster
I also added the default value to the role’s defaults/main.yaml
:
---
pve_cluster_name: pvecluster
...
Initially, I found the status of the cluster nodes in the role’s tasks/main.yaml
but when I added the task to create a new cluster I decided to put it in its own tasks file and reuse it to rescan for the newly setup node after so the join process works the same for the other nodes whether the cluster is new or already existed.
The pvecm status
command prints that /etc/pve/corosync.conf
is missing and that (the missing file) suggests the node is not part of a cluster. So this seemed like a sensible test for the node being clustered. By delegating to each node, all of the nodes will get the facts set. The code sets two facts, a list of nodes that were reached (pve_cluster_nodes_contactable
) and a list that are in a cluster, i.e. /etc/pve/corosync.conf
exists (pve_cluster_nodes_clustered
):
---
- name: Cluster file is stated on all nodes in cluster
delegate_to: '{{ item }}'
ansible.builtin.stat:
path: /etc/pve/corosync.conf
# Don't fail if a node is down, might not be fatal - only need one node
# available to join an existing cluster, for example.
failed_when: false
register: pve_corosync_conf_stat
loop: '{{ pve_cluster_nodes }}'
- name: Clustered and contactable node lists are initialised to empty lists
ansible.builtin.set_fact:
pve_cluster_nodes_clustered: []
pve_cluster_nodes_contactable: []
- name: Nodes that are clustered are known
ansible.builtin.set_fact:
pve_cluster_nodes_clustered: >-
{{
pve_cluster_nodes_clustered
+
[item.item]
}}
loop: '{{ pve_corosync_conf_stat.results }}'
when: not item.failed and item.stat.exists
- name: Nodes that are contactable are known
ansible.builtin.set_fact:
pve_cluster_nodes_contactable: >-
{{
pve_cluster_nodes_contactable
+
[item.item]
}}
loop: '{{ pve_corosync_conf_stat.results }}'
when: not item.failed
- ansible.builtin.debug:
msg: 'Nodes already in cluster: {{ pve_cluster_nodes_clustered | join(", ") }}'
- ansible.builtin.debug:
msg: 'Nodes that can be reached and should be in cluster: {{ pve_cluster_nodes_contactable | join(", ") }}'
...
Finally, the tasks added to the role’s tasks/main.yaml
to actually do the clustering if pve_cluster_nodes
variable is set. On a new node I found that pveproxy
had not yet been restarted after the certificate was updated, which resulted in certificate errors so I added a flushing of handlers to ensure the handler to restart pveproxy
had run before the clustering, if required. I also had to set a longer timeout on the cluster join command as I found it was timing out occasionally.
- name: Handlers have flushed, so pveproxy certificates are correct
ansible.builtin.meta: flush_handlers
- name: Cluster is joined
block:
# There are two situations we can automatically deal with:
# 1. There is an existing clustered node to join with.
# or
# 2. All of the clustered nodes are contactable and none are in
# a cluster (in which case a new cluster needs to be setup).
- name: Clustered node status is known
ansible.builtin.include_tasks: clustered-nodes-status.yaml
- name: Cluster is setup
run_once: true
become: true
ansible.builtin.command:
cmd: /usr/bin/pvecm create {{ pve_cluster_name }}
when: pve_cluster_nodes_clustered | length == 0 and pve_cluster_nodes_contactable | length == pve_cluster_nodes | length
- name: Clustered node status is refreshed if newly setup
ansible.builtin.include_tasks: clustered-nodes-status.yaml
when: pve_cluster_nodes_clustered | length == 0 and pve_cluster_nodes_contactable | length == pve_cluster_nodes | length
- name: Have a node to join with
ansible.builtin.assert:
that: pve_cluster_nodes_clustered | length > 0
fail_msg: No existing cluster nodes found to join with.
- name: Node is joined to cluster
become: true
ansible.builtin.expect:
# Need to use FQDN with Let's Encrypt certificate
command: /usr/bin/pvecm add {{ pve_cluster_join_target }}.{{ ansible_facts.domain }}
responses:
"Please enter superuser \\(root\\) password for '[^']+':": >-
{{
lookup(
'community.hashi_vault.vault_read',
'kv/hosts/' + pve_cluster_join_target + '/users/root'
).data.password
}}
timeout: 90 # Sometimes takes longer than 30s default
vars:
pve_cluster_join_target: '{{ pve_cluster_nodes_clustered | first }}'
when: inventory_hostname not in pve_cluster_nodes_clustered
when: pve_cluster_nodes is defined
In the proxmox_virtual_environment_hosts
’s group variables, I set the pve_cluster_nodes
variable to be a lookup of all inventory hosts in that group. This (the use of a group, as well as the group name) are site-specific so best to keep outside the role:
pve_cluster_nodes: >-
{{
query(
'ansible.builtin.inventory_hostnames',
'proxmox_virtual_environment_hosts'
)
}}
Ceph
Configuring LVM with Ansible
In order to setup Ceph, I need to setup the disks for it to use. I have been putting this off, and even hacked my automated Debian preseed as I ran out of space installing ProxmoxVE in the first place. The preseed is intended to produce installs for small disks, for VM use, and my intention was always that Ansible would resize as required - keeping the preseed as a “one size fits all” bootstrap with all customisation done through Ansible. Doing this work also enabled me to set custom (lvm) filesystem sizes generally.
To configure the volumes, I added a new role called filesystems
(with the intention it might be used for more than just lvm, if needed, in the future) and configured arguments that allow it to manage volume groups and logical volumes within them:
---
argument_specs:
main:
description: Configure filesystems on the target
options:
filesystems_lvm_volume_groups:
description: Logical volume groups to configure
type: list
elements: dict
default: []
options:
name:
description: Name of the volume group
type: str
required: true
logical_volumes:
description: List of logical volumes to manage
type: list
elements: dict
options:
name:
description: Name of the logical volume
type: str
required: true
size:
description: >-
Size of the logical volume (see
<https://docs.ansible.com/ansible/latest/collections/community/general/lvol_module.html#parameter-size>
for valid values)
type: str
required: false
...
The corresponding defaults/main.yaml
file:
---
filesystems_lvm_volume_groups: []
...
The main.yaml
file is just a loop over the volume groups, the default of the empty list means that nothing will happen on hosts with no filesystems_lvm_volume_groups
set (making it safe to apply to all hosts):
---
- name: Logical volumes exist and are correct sizes
ansible.builtin.include_tasks: lvm.yaml
vars:
filesystems_lvm_lvs: '{{ lvm_item.logical_volumes }}'
filesystems_lvm_vg: '{{ lvm_item.name }}'
loop: '{{ filesystems_lvm_volume_groups }}'
loop_control:
# Avoid clashing with inner loop
loop_var: lvm_item
...
The lvm.yaml
tasks file just ensures the each logical volume in the group is correct:
---
- name: All logical volumes are correct
become: true
community.general.lvol:
lv: '{{ item.name }}'
resizefs: true
size: '{{ item.size }}'
vg: '{{ filesystems_lvm_vg }}'
loop: '{{ filesystems_lvm_lvs }}'
...
I hardcoded resizefs
to be true
- this means that if the volume grows and it’s one of the list supported (ext2, ext3, ext4, reiserfs and XFS at time of writing) the filesystem will be resized by the module. However, if an unsupported volume is resized then the module will fail.
I also did not override the default force
setting of false
, which means attempts to shrink a volume will also cause the module to fail.
I applied the new role to all Linux systems by add it to the existing play that targets all:!dummy
(all real hosts) but using an include_role
task with a condition (when
) to only apply to Linux systems:
- name: Filesystems are configured on Linux systems
ansible.builtin.import_role:
name: filesystems
when: ansible_facts.system == 'Linux'
Finally, I added the larger root
volume and creation of the ceph_osd
volume to proxmox_virtual_environment_hosts
’s group variables:
filesystems_lvm_volume_groups:
- name: vg_{{ inventory_hostname }}
logical_volumes:
- name: root
size: 5G
- name: ceph_osd
size: 200G
Installing Ceph
I stated with a block, and presumed (per the comment) that all cluster nodes will also be part of the Ceph cluster:
# XXX Presumes PVE cluster nodes will always also be ceph cluster
- name: Ceph is configured (for clusters)
block:
#...
when: pve_cluster_nodes is defined
As I have previously configured the Ceph repository, all I needed to do was install the packages and reload the ProxmoxVE services (based on a forum post on Proxmox’ forums) as the first tasks in this block:
- name: Ceph packages are installed
become: true
ansible.builtin.package:
name: ceph
state: present
notify:
- Reload pveproxy
- Reload pveademon
- name: Ensure handlers are flushed (to reload daemons if Ceph just installed)
meta: flush_handlers
This uses two new handlers, which I added to the role’s handlers/main.yaml
file:
- name: Reload pveproxy
become: yes
ansible.builtin.service:
name: pveproxy
state: reloaded
- name: Reload pvedaemon
become: yes
ansible.builtin.service:
name: pvedaemon
state: reloaded
Initialising the Ceph cluster
I determined from running pveceph status
that ProxmoxVE itself uses the presence of /etc/pve/ceph.conf
to determine if Ceph has been initialised or not:
$ pveceph status
pveceph configuration not initialized - missing '/etc/pve/ceph.conf', missing '/etc/pve/ceph'
So I used the same test in my Ansible playbook:
- name: Ceph config file is stated
ansible.builtin.stat:
path: /etc/pve/ceph.conf
register: ceph_conf_stat
- name: Ceph cluster is initialised if no config file found
run_once: true
become: true
ansible.builtin.command: /usr/bin/pveceph init
when: not ceph_conf_stat.stat.exists
Firewall and monitors
The next step in installing Ceph is setting up the monitors. Configuring the firewall also requires knowing whether a node is a monitor or not as the firewall requirements are slightly different.
Advice varies on how many monitors is required - Proxmox’ documentation says to deploy exactly 3:
For high availability, you need at least 3 monitors. One monitor will already be installed if you used the installation wizard. You won’t need more than 3 monitors, as long as your cluster is small to medium-sized. Only really large clusters will require more than this.
The Ceph documentation has a slightly different view:
For small or non-critical deployments of multi-node Ceph clusters, it is recommended to deploy three monitors. For larger clusters or for clusters that are intended to survive a double failure, it is recommended to deploy five monitors. Only in rare circumstances is there any justification for deploying seven or more monitors.
And explicitly says five monitors is recommended for 5 or more nodes:
A typical Ceph cluster has three or five monitor daemons that are spread across different hosts. We recommend deploying five monitors if there are five or more nodes in your cluster.
However, I found one Proxmox forum post saying that 5 monitors is overkill for 15 node deployment:
3 monitors are fine for small to medium (and 15 OSD nodes is definitely still that category for Ceph ;)) clusters
In short, advice is somewhat contradictory. I initially followed Proxmox’ advice and set up 3 monitors but later expanded this to 5 as I am aiming for a cluster that can survive a double node failure. However, initially setting up 3 monitors means I have the right template for when I roll this out to my home lab, which has 10 nodes and will also have 5 monitors (i.e. not all nodes will be monitors).
Rather than try and be too clever and calculate which nodes will be monitors based on number of monitors and available nodes, I decided to explicitly list the monitors by adding a variable to the proxmox_virtual_environment
’s group_vars
.
Originally, I set this to the odd numbered nodes:
# Use odd number nodes as monitors in my 5 node cluster
pve_ceph_monitors:
- pve01
- pve03
- pve05
But then replaced it with all 5 nodes (done this way as the same node list will also work unmodified for the 10 node cluster):
# Use first 5 nodes as monitors
pve_ceph_monitors:
- pve01
- pve02
- pve03
- pve04
- pve05
To make life easy, I added a task to set a boolean based on if this node is a monitor or not:
- name: If this host should be a monitor is known
ansible.builtin.set_fact:
pve_is_monitor: '{{ inventory_hostname in pve_ceph_monitors }}'
Firewall
Firewalld comes with service definitions for Ceph (ceph
) and Ceph monitors (ceph-mon
), so just need to enable/disable them as appropriate. The firewall needs to configured before the monitors can be setup as they need to be able to communicate to configure.
# Firewall needs to be configured before ceph monitors are
# setup (or they won't be able to communicate).
# For firewall details, see:
# https://docs.ceph.com/en/reef/rados/configuration/network-config-ref/
- name: firewalld is configured
block:
# ceph and ceph-mon services come with firewalld
# XXX Want to make this better (properly zoned, not just opened to everything via default zone)
- name: Ceph metadata/magnager/osd service is allowed
become: yes
ansible.posix.firewalld:
service: ceph
permanent: true
state: enabled
notify: reload firewalld
- name: Ceph Monitor service is allowed (on monitors)
become: yes
ansible.posix.firewalld:
# ceph-mon comes out of the box with ceph and/or proxmoxve
service: ceph-mon
permanent: true
state: enabled
notify: reload firewalld
when: pve_is_monitor
- name: Ceph Monitor service is not allowed (on non monitors)
become: yes
ansible.posix.firewalld:
# ceph-mon comes out of the box with ceph and/or proxmoxve
service: ceph-mon
permanent: true
state: disabled
notify: reload firewalld
when: not pve_is_monitor
- name: Handlers have flushed, so ceph becomes accessible if firewall has changed
ansible.builtin.meta: flush_handlers
Monitors
To make this idempotent, I used the ceph mon metadata
command to get the current monitor configuration (in JSON format), extract the list of configured monitors and configure the local host only if it is not in that list.
Ceph’s documentation recommends running the manager on all monitors, although only one manager is needed and it is not critical to the filesystem, so I did that too (using the same method to make it idempotent):
- name: Monitors are configured
block:
- name: Current monitor metadata is known
become: true
ansible.builtin.command: /usr/bin/ceph mon metadata
register: ceph_mon_metadata_out
changed_when: false # Read-only operation
- name: List of configured monitors is known
ansible.builtin.set_fact:
pve_configured_monitors: '{{ ceph_mon_metadata_out.stdout | from_json | map(attribute="name") }}'
- name: Monitors are set up
become: true
ansible.builtin.command: /usr/bin/pveceph mon create
when: inventory_hostname not in pve_configured_monitors
# > In general, you should set up a ceph-mgr on each of the hosts
# > running a ceph-mon daemon to achieve the same level of
# > availability.
# - <https://docs.ceph.com/en/reef/mgr/administrator/#high-availability>
- name: Current manager metadata is known
become: true
ansible.builtin.command: /usr/bin/ceph mgr metadata
register: ceph_mgr_metadata_out
changed_when: false # Read-only operation
- name: List of configured managers is known
ansible.builtin.set_fact:
pve_configured_managers: >-
{{
ceph_mgr_metadata_out.stdout
| from_json
| map(attribute="name")
}}
- name: Managers are set up
become: true
ansible.builtin.command: /usr/bin/pveceph mgr create
when: inventory_hostname not in pve_configured_managers
when: pve_is_monitor
OSDs
Next, I need to setup the LVM volume created earlier to be used by Ceph. This is complicated by pveceph
not supporting adding logical volumes (attempting to add one will result in an error unable to get device info for '/dev/dm-...'
). It can be done with Ceph’s native tools however some extra steps are required to allow the Ceph tools to find the configuration file (the symlink from /etc/ceph/ceph.conf
to /etc/pve/ceph.conf
seems to get created when pveceph
adds an OSD) and authenticate (but this is only required to actually add the disk).
To make this idempotent, I only attempt to add the volume if there are no OSDs for the current system - this presumes there is only one OSD. I also hardcoded the volume group template and logical volume name in the task to add it, I will need to revisit this if I change this (e.g. add a 2nd disk just for Ceph) in the future.
- name: OSD is setup
block:
- name: Ceph config symlink exists
become: true
ansible.builtin.file:
path: /etc/ceph/ceph.conf
src: /etc/pve/ceph.conf
state: link
- name: Current OSD metadata is known
become: true
ansible.builtin.command: /usr/bin/ceph osd metadata
register: ceph_osd_metadata_out
changed_when: false # Read-only operation
- name: This system's OSDs are known
ansible.builtin.set_fact:
pve_ceph_local_osds: >-
{{
ceph_osd_metadata_out.stdout
| from_json
| selectattr('hostname', 'eq', inventory_hostname)
}}
- name: OSD is configured
block:
- name: Bootstrap keyring is correct
become: true
ansible.builtin.shell:
cmd: umask 0077 ; ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring
creates: /var/lib/ceph/bootstrap-osd/ceph.keyring
- name: OSD volume is added
become: true
# XXX Assumes vg and lv names follow this convention...
ansible.builtin.command: /usr/sbin/ceph-volume lvm create --data vg_{{ inventory_hostname }}/ceph_osd
when: pve_ceph_local_osds | length == 0
Pool
Finally, I created a Ceph pool for VM disks based on a template I found online. Finding the output-format options documentation was not easy, buried as it is in an appending to the Administration Guide and not included in the manual pages of commands that support it.
# Based on: <https://github.com/rgl/proxmox-ve-cluster-vagrant/blob/master/provision-storage.sh>
- name: Pools are setup
run_once: true
block:
- name: Current pools are known
become: true
ansible.builtin.command: /usr/bin/pveceph pool ls --output-format json
register: pve_ceph_pool_ls_out
changed_when: false # Read-only operation
- name: VM pool is created
# Only needs to happen on one ceph node
run_once: true
become: true
# XXX pool name should be configurable, as should other settings (e.g. size (replicas) and min_size (min replicas or pool fails))
# Size 4, min_size 2 means can tolerate 2 node failures
# pg_num from "optimal pg number" in `pveceph pool ls`
ansible.builtin.command: /usr/bin/pveceph pool create ceph-vm --size 4 --pg_num 32
when: pve_ceph_pool_ls_out.stdout | from_json | selectattr('pool_name', 'eq', 'ceph-vm') | length == 0
Storage
With Ceph up and running, I need to add storage to ProxmoxVE with Ceph as the underlying store using pvesm. For now, I only created a single storage pool for VM disks (using rbd
PVE storage pool type). The pvesm status
command does not support a machine-parsable output format, so I had to use its exit status (determined empirically) to detect whether the pool has been setup or not. This is a single operation for the cluster:
- name: PVE Storage is configured
run_once: true
block:
- name: Status of storage is known
become: true
ansible.builtin.command: /usr/sbin/pvesm status --storage ceph-vm
register: pvesm_status_cephvm
# Returns 255 if storage doesn't exist
failed_when: pvesm_status_cephvm.rc not in [0, 255]
changed_when: false # Read-only operation
- name: VM storage is created
become: true
ansible.builtin.command: /usr/sbin/pvesm add rbd ceph-vm --content images --krbd 0 --pool ceph-vm --username admin
when: pvesm_status_cephvm.rc == 255
Networking
Before I can create a VM, I need to create the default bridge interface Proxmox VE expects. ProxmoxVE will consider interfaces defined in /etc/network/interfaces
to be managed by it, and after modifying /etc/network/interfaces
, ifreload -a
must be run for ProxmoxVE to re-read it.
For now, I hardcoded this for my live network in the proxmox-virtual-environment
role - this is not ideal as the general network configuration should be done at a much more general level (suitable for configuring any network interface on any system). I also hardcoded the interface names, which is another thing I had been avoiding - perferring to identify them by MAC instead in the rest of my Ansible roles.
Firstly, I needed to remove the automatic configuration from the Debian installer then add the static configuration for the bridge vmbr0
:
- name: Network is configured
# XXX this is too hardcoded wrt. interface names and needs to be at a higher (more general) level...
block:
- name: DHCP configured interface is removed
become: true
ansible.builtin.lineinfile:
path: /etc/network/interfaces
regexp: '{{ item }}'
state: absent
loop:
- '^iface eno1 inet dhcp'
- '^auto eno1'
- '^allow-hotplug eno1'
# Note comments from ProxmoxVE supplied /etc/network/interfaces:
# > # If you want to manage parts of the network configuration manually,
# > # please utilize the 'source' or 'source-directory' directives to do
# > # so.
# > # PVE will preserve these directives, but will NOT read its network
# > # configuration from sourced files, so do not attempt to move any of
# > # the PVE managed interfaces into external files!
- name: PVE default bridge is configured
become: true
ansible.builtin.blockinfile:
path: /etc/network/interfaces
block: |
iface eno1 inet manual
auto vmbr0
iface vmbr0 inet static
address 192.168.10.5{{ inventory_hostname[-1] }}/24
gateway 192.168.10.250
bridge-ports eno1
bridge-stp off
bridge-fd 0
notify: ifreload
The ifreload
handler is straight forward and just added to the proxmox_virtual_environment
’s handlers/main.yaml
:
- name: ifreload
become: yes
ansible.builtin.command: /usr/sbin/ifreload -a
Onwards and upwards…
Now I have a functioning Proxmox Virtual Environment, clustered and with Ceph storage configured ready to deploy virtual machines onto, which is my next task….