Clustering Hashicorp Vault and SSL Ansible role improvements
Now I have the ability to unlock my Proxmox Virtual Environment (ProxmoxVE) hosts with a static LUKS unlock passphrase, I can move my existing single-node Hashicorp Vault install to be a cluster also on these hosts. This adds resilience to, what has become, a crucial piece of my infrastructure and moves it from the monitoring host in the same network segment as my desktop to the new “services” segment, improving security.
SSL certificates
In order to setup Hashicorp Vault originally I deployed it on an existing web-server, which already had SSL certificates deployed on it, and configured it to use the certificates deployed by my existing ssl
role which was only applied to the webservers
group. When I setup my ProxmoxVE hosts, I used one SSL certificate shared by all hosts. To work with my existing host-based SSL certificates solution from the web-server role, I changed this to use per-host certificates.
Firstly, I obtained new Let’s Encrypt certificates for each host, with Subject Alternative Names for the host, vault.home.entek.org.uk
and pve.home.entek.org.uk
. These I stored in my existing vault using the same host-orientated structure as my existing monitoring server, so the host-based generic lookup in my ssl
role will work for these too.
I added the hashicorp_vault_servers
, monitoring_servers
and proxmox_virtual_environment_hosts
groups as children to my existing ssl_host_certificate
group, which was only used to set the variables used by the ssl
role for host-specific certificates, in the inventory:
# Hosts with SSL certificates in the vault
ssl_host_certificate:
children:
hashicorp_vault_servers:
monitoring_servers:
proxmox_virtual_environment_hosts:
The role ssl
, I applied to this group and removed the same task from the - hosts: monitoring_servers
play:
- hosts: ssl_host_certificate
tasks:
- name: Setup host-based SSL certificate
ansible.builtin.import_role:
name: ssl
vars:
dhparam:
file: '{{ ssl_dhparam.file }}'
bits: '{{ ssl_dhparam.bits }}'
ssl_certificate:
certificate: "{{ ssl_host_certificate.certificate }}"
certificate_file: "{{ ssl_host_certificate_files.certificate }}"
ca_bundle: "{{ ssl_host_certificate.ca_bundle }}"
ca_bundle_file: "{{ ssl_host_certificate_files.ca_bundle }}"
full_chain_file: "{{ ssl_host_certificate_files.full_chain }}"
key: "{{ ssl_host_certificate.key }}"
key_file: "{{ ssl_host_certificate_files.key }}"
Next, in the proxmox-virtual-environment
role, I changed the argument that previously took the certificate itself to take the local files:
pve_pveproxy_certificate:
description: Certificate files for pveproxy (user-facing interface to Proxmox Virtual Environment)
type: dict
required: false
options:
full_chain:
description: The SSL certificate full chain file
required: true
type: str
key:
description: The SSL certificate key file
required: true
type: str
and removed the temporary files (as they are no longer needed), instead updating the certificates in the /etc/pve
virtual filesystem from the existing local files deployed by the existing ssl
role:
---
- name: pveproxy SSL certificates are correct, if provided
block:
# /etc/pve is a FUSE view of a SQLite database, permissions
# cannot be set and are handled by the fuse driver.
# see: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)
# So ansible's copy module cannot be used:
# see: https://github.com/ansible/ansible/issues/19731
# and: https://github.com/ansible/ansible/issues/40220
- name: Certificate chain is correct
become: yes
ansible.builtin.shell: >-
diff {{ pve_pveproxy_certificate.full_chain }} /etc/pve/local/pveproxy-ssl.pem
&& echo "Cert correct"
|| cp {{ pve_pveproxy_certificate.full_chain }} /etc/pve/local/pveproxy-ssl.pem
register: output
changed_when: 'output.stdout != "Cert correct"'
notify: Restart pveproxy
- name: Certificate key is correct
become: yes
ansible.builtin.shell: >-
diff {{ pve_pveproxy_certificate.key }} /etc/pve/local/pveproxy-ssl.key
&& echo "Key correct"
|| cp {{ pve_pveproxy_certificate.key }} /etc/pve/local/pveproxy-ssl.key
register: output
changed_when: 'output.stdout != "Key correct"'
notify: Restart pveproxy
# Do not display keys
no_log: true
when: pve_pveproxy_certificate is defined
...
Finally, I removed pve_pveproxy_certificate
(which contained the “old” lookup for the certificate from the vault) from group_vars/proxmox_virtual_environment_hosts.yaml
.
Improving the SSL certificate handler
My current ssl
role does not notify any handlers when the certificates are updated. While this works fine when initially building web-servers, it means if the certificates are refreshed then nothing gets notified to re-read them. This is not ideal, particularly for relatively short-lived certificates issued by Let’s Encrypt. Fortunately, Ansible has the capability to notify an arbitrary number of handlers using the listen directive.
To ensure there is always at least one handler (to prevent any errors), I added handlers/main.yaml
to the ssl
role which just prints a message it was notified:
---
- name: Dummy debug handler so at least one certificate updated handler is defined
ansible.builtin.debug: msg="SSL certificate update handler fired"
listen: "ssl certificates updated"
...
To each of the tasks that updates the certificate files (dh_param, certificate, CA certificate, full chain or key), I added a notification to this listener:
notify: "ssl certificates updated"
In the webserver
role’s handlers, I change the Restart nginx
handler to listen for this notification:
- name: Restart nginx
become: yes
ansible.builtin.service:
name: nginx
state: restarted
listen: "ssl certificates updated"
and the same in the existing hashicorp-vault
role’s handler:
---
- name: restart hashicorp vault
become: yes
ansible.builtin.service:
name: vault
state: restarted
listen: "ssl certificates updated"
...
Clustering Vault
Because my ProxmoxVE hosts are not yet in DNS, I added them to the /etc/hosts
file on my existing Vault server. I also temporarily replaced the 127.0.1.1
resolution for the local hostname with its actual IP address, in-case this affected Vaults internal redirection. I did not need to make any changes to my ProxmoxVE hosts because they already have one-another in their /etc/hosts
file and will be able to resolve the existing server via DNS.
My original plan was to add the new nodes to a cluster with the existing Vault server, then remove the “old” server from the cluster once I was satisfied it was working. However, when I originally setup Vault I set the cluster_addr
to 127.0.0.1
(as I was not clustering at the time). I updated this to be the host’s fully-qualified name and restarted Vault, however vault operator raft list-peers
still showed the address of the (one) host in the cluster as 127.0.0.1:8201
. From the Vault documentation, this address will be used for internal traffic between the servers in a cluster so I expect this not to work. I found documentation on recovering from lost quorum, which reads as though the process might correct the incorrect address but as this is a 1-node “cluster” at present I did not want to risk breaking it.
Instead, I decided to create a new Vault cluster and restore my backup snapshot following the documented restore process. Not only does this remove the risk associated with the unknown (by me) impact of the misconfiguration of the existing server, it allows me to fully test the DR restore process with a newly initialised Vault cluster.
Setting up new Vault servers
Turning my existing single-node install Ansible Playbook into a multi-node Playbook was fairly simple. I added the ProxmoxVE hosts to the vault server group:
hashicorp_vault_servers:
children:
proxmox_virtual_environment_hosts:
In order to enable the nodes to automatically attempt to rejoin the cluster they form, they each need to know the other hosts in the cluster. In order to configure this, I added a parameter for the cluster peers to my hashicorp-vault
role’s meta/argument_specs.yaml
:
vault_cluster_peers:
description: Peers to cluster with (need to be resolvable names)
required: false
default: []
type: list
elements: str
and set the default in the role’s defaults/main.yaml
:
---
vault_cluster_peers: []
...
I then updated the /etc/vault.d/vault.hcl
template to add retry_join
stanza for each of the peers, and changed the cluster_addr
parameter to be the current host’s fully-qualified domain name (rather than 127.0.0.1
):
storage "raft" {
path = "/opt/vault/data"
node_id = "{{ ansible_facts.hostname }}"
{% for peer in vault_cluster_peers %}
retry_join {
leader_api_addr = "https://{{ peer }}:8200"
}
{% endfor %}
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "{{ ssl_certificate_files.full_chain }}"
tls_key_file = "{{ ssl_certificate_files.key }}"
}
# Address to advertise to other vault servers for client
# redirection.
api_addr = "https://{{ ansible_facts.fqdn }}:8200"
cluster_addr = "http://{{ ansible_facts.fqdn }}:8201"
I also added 8201/tcp
to the list of ports in the vault
firewalld service and replaced looking up the Vault address from the Ansible environment ({{ lookup('ansible.builtin.env', 'VAULT_ADDR', default='https://127.0.0.1:8200/') }}
) with the local host (https://{{ ansible_facts.fqdn }}:8200
), which will automatically forward requests to the leader of the cluster.
Finally, I set the value of vault_peers
in my group_vars/hasicorp_vault_servers.yaml
to be all of the members of that group:
---
vault_cluster_peers: >-
{{
groups['hashicorp_vault_servers']
|
map('extract', hostvars)
|
map(attribute='ansible_facts.fqdn')
}}
...
With these changes, Vault installed successfully on all of my Proxmox nodes. I ran VAULT_ADDR=https://$( hostname -f ):8200/ vault operator init
on one of the cluster hosts and unsealed it, but did not unseal any of the others yet.
Restoring backup
Again, this turned out to be fairly straightforward following the Standard procedure for restoring a Vault cluster documentation:
-
I copied the latest backup,
backup-latest.snap
, from existing Vault server to one of the new cluster hosts. -
On the cluster host I copied the backup to, I logged in with the new cluster’s initial root token (from when I ran
vault operator init
):VAULT_ADDR=https://$( hostname -f ):8200/ vault login
-
I then imported the snapshot:
VAULT_ADDR=https://$( hostname -f ):8200/ vault operator raft snapshot restore -force backup-latest.snap
-
I then unsealed the vault (which was now sealed again) using the restored snapshot’s unseal keys:
VAULT_ADDR=https://$( hostname -f ):8200/ vault operator unseal
-
Finally, I unsealed the other nodes in the cluster. Once the cluster reached quorum, the “Active Node Address” reported by
vault status
changed to one of the cluster nodes (from the “old” Vault url).
The status of an individual node can be checked with the /sys/health
API endpoint, which returns status 200
if the Vault is unsealed and this node is active, 429
if the Vault is unsealed and this node is standby, 501
is the Vault is not initialised and 503
is this node is sealed.
The API will respond to just a HEAD
HTTP request:
curl -I https://$( hostname -f ):8200/v1/sys/health
or, to just get print the status code:
curl -I -s -S -o /dev/null -w '%{http_code}' https://$( hostname -f ):8200/v1/sys/health
At some point, I need to revisit my Ansible Playbooks and replace existing use of the vault status
command with calls to the local API. For now I made an issue with this on my Git platform.