Welcome to the fifth post in the rabbit-hole I disappeared down, trying to add a bastion host to my network for Ansible after trying to get started with Ansible for managing iPXE configuration in October. This post is about deploying HashiCorp Vault for secret management, beginning with the PKI certificates for Icinga monitoring.

As a recap, I am currently stalled migrating the monitoring salt roles from my states tree (nesting indicates where one role include another) that I am migrating to Ansible. Completed ones are crossed through:

  • server
    • remotely-accessible (installs and configures fail2ban)
    • monitoring.client (installs and configures icinga2 client)
      • monitoring.common (installs monitoring-plugins and nagios-plugins-contrib packages, installs & configures munin client and nagios’ kernel & raid checks.)
  • monitoring.server (installs and configures icinga2 server, php support for nginx, icingaweb2 and munin server)
    • webserver (installs and configures nginx)
    • monitoring.common (see above)

Mirroring the HashiCorp repository

As I am working in an air-gapped home lab, my first task was to get the Vault packages into the lab’s mirror. I presumed that this would be as simple as adding it to my mirror playbook with my debmirror role. Unfortunately, HashiCorp’s Debian mirror is stored at the root of the mirror URL, https://apt.releases.hashicorp.com/, and not in a sub-directory (such as /debian). debmirror is unable to mirror such repositories due to not supporting an empty (or not specifed) root directory. This has been reported as a bug to both Ubuntu and Debian. The Debian maintainer’s, entirely reasonable, stance is that because the official repositories are not structured this way, this is a feature request rather than a bug: “As a root directory is standard on official mirrors, I see the as a wishlist item.”

Rather than hacking the debmirror script installed by the package, I found apt-mirror which does work with these repositories but it has less options for selecting what is mirror - I am unsure if this is resulting in a larger mirror than could be done with debmirror but, if it is, I am sacrificing disk space for convenience (speed of moving forwards) in this case.

One other thing with apt-mirror is that is always creates a subdirectory with the name of the host being mirrored, apt.releases.hashicorp.com/ in this case - just something to be mindful of when using it.

I added an apt-mirror role to my mirroring playbook repository (fetch-key.yaml is identical to the debmirror role one - a horrible piece of duplication that could be solved by moving the key fetching to its own role):

---
- name: Install apt-mirror
  become: yes
  ansible.builtin.package:
    name: apt-mirror
    state: present
- name: Make target directory
  ansible.builtin.file:
    path: "{{ target.directory }}"
    state: directory
- name: Make keys directory
  ansible.builtin.file:
    path: "{{ target.keyring_directory }}/keys.d"
    state: directory
  when: source['keys'] is defined
- name: Download keys for keyring
  include_tasks: fetch-key.yaml
  loop: "{{ source['keys'] | default([]) }}"
  loop_control:
    loop_var: key
- name: Delete any extra keys
  ansible.builtin.file:
    path: "{{ item }}"
    state: absent
  with_fileglob: "{{ target.keyring_directory }}/keys.d/*.key"
  when: item.split('/')[-1].split('.')[0] not in source['keys'] | map(attribute='name') | list
- name: Delete old keyring
  ansible.builtin.file:
    path: "{{ target.keyring_directory }}/keyring.gpg"
    state: absent
- name: Make keyring
  ansible.builtin.shell: gpg --no-default-keyring --keyring {{ target.keyring_directory }}/keyring.gpg --import {{ item }}
  with_fileglob: "{{ target.keyring_directory }}/keys.d/*.key"
- name: Make configuration directory
  ansible.builtin.file:
    path: "{{ target.mirror_list | dirname }}"
    state: directory
- name: Make temporary directory
  ansible.builtin.tempfile:
    state: directory
  register: apt_mirror_tempdir
- name: Make configuration file
  ansible.builtin.template:
    dest: "{{ target.mirror_list }}"
    src: mirror.list.j2
- name: Run apt-mirror
  ansible.builtin.command:
    argv:
      - apt-mirror
      - "{{ target.mirror_list }}"
- name: Check download path exists for any additional files
  ansible.builtin.file:
    path: "{{ target.directory }}/{{ source.url | urlsplit('hostname') }}/{{ item | dirname }}"
    state: directory
  loop: "{{ selectors.additional_files | default([]) }}"
- name: Download any additional files from the mirror
  ansible.builtin.get_url:
    url: "{{ source.url }}/{{ item }}"
    dest: "{{ target.directory }}/{{ source.url | urlsplit('hostname') }}/{{ item }}"
  loop: "{{ selectors.additional_files | default([]) }}"
- name: Remove temporary files
  ansible.builtin.file:
    path: "{{ apt_mirror_tempdir.path }}"
    state: absent
...

The mirror list template looks like this:

set base_path "{{ apt_mirror_tempdir.path }}"
set mirror_path "{{ target.directory }}"

{% for suite in selectors.suites %}
deb {{ source.url }} {{ suite }} {{ ' '.join(selectors.components) }}
{%   for architecture in additional_architectures | default([]) %}
deb-{{ architecture }} {{ source.url }} {{ suite }} {{ ' '.join(selectors.components) }}
{%   endfor %}
{% endfor %}

clean {{ source.url }}

and the argument specification follows the common pattern I adopted:

---
argument_specs:
  main:
    short_description: Main entry point for mirroring a repository with apt-mirror
    options:
      target:
        description: Locations to download to
        type: dict
        required: true
        options:
          directory:
            type: str
            required: true
            description: Directory to mirror to
          keyring_directory:
            type: str
            required: true
            description: Directory to download keys and store keyring in.
          mirror_list:
            type: str
            required: true
            description: Location to generate the config file for apt-mirror.
      source:
        description: Where to mirror from
        type: dict
        options:
          url:
            type: str
            default: ftp.debian.org
            description: Hostname to mirror from
          keys:
            type: list
            elements: dict
            options:
              name:
                type: str
                required: true
                description: Name of key (will be used for download filename)
              url:
                type: str
                requires: true
                description: Where to fetch the key from
      selectors:
        type: dict
        required: yes
        options:
          suites:
            type: list
            elements: str
            required: true
            description: The list of suites to mirror
          components:
            type: list
            elements: str
            default: ['main']
            description: The list of components to mirror
          additional_architectures:
            type: list
            elements: str
            default: []
            description: The list of architectures to mirror
          additional_files:
            type: list
            elements: str
            default: []
            description: List of additional files (relative to source -> url) to download to the mirror
...

To do the mirror, I added the apt and rpm repositories to the list of repositories to mirror (note I downloaded the gpg key as well):

# HashiCorp
- type: apt-mirror
  target:
    # apt-mirror automatically creates per-host sub-directories
    directory: "{{ mirror_base_path }}/apt-mirror"
    keyring_directory: "{{ mirror_base_path }}/keyrings/hashicorp"
    mirror_list: "{{ mirror_base_path }}/apt-mirror/hashicorp-mirror.list"
  source:
    url: https://apt.releases.hashicorp.com
    keys:
    - name: hashicorp-archive-keyring
      url: https://apt.releases.hashicorp.com/gpg
  selectors:
    additional_architectures:
    - src
    suites:
    - bullseye
    components:
    - main
    additional_files:
    - gpg
- type: reposync
  target:
    directory: "{{ mirror_base_path }}"
    yum_conf: "{{ mirror_base_path }}/yum-configs/hashicorp.yum.conf"
  source:
    repos:
# EL 7 repo broken 2023-01-23 - see https://discuss.hashicorp.com/t/404-error-from-rhel-repo/14427/9
#        - name: hashicorp-el7
#          description: HashiCorp-el7
#          baseurl: https://rpm.releases.hashicorp.com/RHEL/7/x86_64/stable
#          gpgkey: https://rpm.releases.hashicorp.com/gpg
    - name: hashicorp-el8
      description: HashiCorp-el8
      baseurl: https://rpm.releases.hashicorp.com/RHEL/8/x86_64/stable
      gpgkey: https://rpm.releases.hashicorp.com/gpg

At some point I want to either submit a patch to debmirror to fix this itch, or migrate all of my debmirrored repositories to apt-mirror and only maintain one method of mirroring these repositories.

Installing Vault

Once the repository is finally available, I could install Vault. The next question I faced was “where?”. Ideally, like the Bastion host, this would be a self-contained physical host doing nothing else however as I previously described I don’t have the luxury of spare physical systems (or disposable cash to buy more at the moment). The three systems I considered are my router, the monitoring system and the VM host (I do not want to run this in a VM as it will be critical to the infrastructure, so minimal parts that could break is desirable) - like the Bastion host, I concluded the monitoring box was the “least bad”. It is the easiest to harden and, although the router is also hardened as the boundary device, putting the Vault on a non-edge system immediately adds an extra layer that has to be broken through to compromise the vault from outside. I may revisit the decision not to put it in a VM, when I have had chance to fully think through and test a DR plan for Ansible with Vault - ultimately I am thinking that I may migrate to short-lived certificates for ssh access, using Vault and I would not want to create a situation where I could not login to the VM (which might have to be via the virtual console, accessed through the OS of the physical host the VM is running on) if something broke.

First I added the repository (these commands are taken more-or-less directly from HashiCorp’s installation tutorial, modified to use /usr/local/share/keyrings instead of spaffing over /usr/share/keyrings and the local mirror - I actually used Ansible to do this by writing an apt-repository role that does these same steps in Ansible):

mkdir -p /usr/local/share/keyrings
wget -O- http://mirror/mirrors/apt.releases.hashicorp.com/gpg | gpg --dearmor >/usr/local/share/keyrings/hashicorp-archive-keyring.gpg
gpg --no-default-keyring --keyring /usr/local/share/keyrings/hashicorp-archive-keyring.gpg --fingerprint
# Check the fingerprint matches 798A EC65 4E5C 1542 8C8E 42EE AA16 FCBC A621 E701
echo "deb [signed-by=/usr/local/share/keyrings/hashicorp-archive-keyring.gpg] http://mirror/mirrors/apt.releases.hashicorp.com $(lsb_release -cs) main" >/etc/apt/sources.list.d/hashicorp.list

Then Vault can be installed (again, I did this in Ansible):

apt update
apt install vault

Configure Vault

Vault is configured in /etc/vault.d/vault.hcl. I changed the default one to look like this:

storage "raft" {
    path = "/opt/vault/data"
    node_id = "node_hostname"
}

listener "tcp" {
    address = "0.0.0.0:8200"
    tls_cert_file = "/opt/vault/tls/tls.crt"
    tls_key_file = "/opt/vault/tls/tls.key"
}

# Address to advertise to other vault servers for client
# redirection.
# I presumed this needs to resolve to this node specifically,
# rather than a generic alias for the cluster (e.g. "vault.fqdn")
api_addr = "https://node_host_fqdn:8200"
# For now, use loopback address as not clustering
cluster_addr = "http://127.0.0.1:8201"

SSL certificates

Vault needs an SSL certificate for its listener (“In production, Vault should always use TLS to provide secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.”). In my production environment, I use Let’s Encrypt for most of my PKI (notable exceptions are closed systems, such a Icinga and OpenVPN) but this requires internet access.

In the long term, my plan is to create an Automatic Certificate Management Environment (ACME) server inside my lab (to enable testing with an ACME system, comparable to how the live system works) - perhaps using step-ca or boulder. In the short term, I added vault as a subject alternative name to the Let’s Encrypt certificate for the live system and copied those certificates into the lab environment, adding the live hostname to /etc/hosts as a stop-gap.

I also added a new dehydrated code-rack hook to allow the dehydrated user to update the Vault certificates in /opt/vault/tls (as the vault user, which owns the certificates).

The hook script is:

#!/bin/bash

SRCDIR=/var/lib/dehydrated/certs
TGTDIR=/opt/vault/tls

set -e
echo "This script ($0) will abort on first error." >&2

cat "$SRCDIR/$DOMAIN/fullchain.pem" | sudo -u vault /usr/bin/tee "$TGTDIR/tls.crt" >/dev/null
cat "$SRCDIR/$DOMAIN/privkey.pem" | sudo -u vault /usr/bin/tee "$TGTDIR/tls.key" >/dev/null

The Sudo permissions required are:

dehydrated hostname=(vault) NOPASSWD: /usr/bin/tee /opt/vault/tls/tls.crt, /usr/bin/tee /opt/vault/tls/tls.key

Firewall

On my most recently installed systems, I have firewalld running instead of using vanilla iptables (or nftables).

To enable access to vault, I needed to add a new service and allow it:

firewall-cmd --permanent --new-service=vault
firewall-cmd --permanent --service=vault --set-description="HasiCorp Vault"
firewall-cmd --permanent --service=vault --set-short="HasiCorp Vault"
firewall-cmd --permanent --service=vault --add-port=8200/tcp
# This just adds it to the default zone - I will want to narrow this scope down in production
firewall-cmd --permanent --add-service=vault
# Reload to add changes to running config
firewall-cmd --reload

To do this with Ansible, I pushed out a service XML definition to /etc/firewalld/services/vault.xml and used the, rather limited, ansible.posix.firewalld module to add the service to the zone.

Initialise the Vault

Once vault is configured and the SSL certificates in place, I started the server:

systemctl start vault

For ease, I exported the VAULT_ADDR environment variable (otherwise every command needs prefixing with it):

export VAULT_ADDR=https://vault.fqdn:8200

and ran the initialisation command:

vault operator init

This prints out the unseal keys and an initial root token - these will be required to unlock the vault in future so do not lose them!

The vault can then be “unsealed” (this command has to be run 3 times, by default, providing 3 of the 5 keys one at a time):

vault operator unseal

The initial root key can be used with the login command to do further work:

vault login

Exploring the Vault

We can see the enabled secret engines with vault secrets list:

$ vault secrets list
Path          Type         Accessor              Description
----          ----         --------              -----------
cubbyhole/    cubbyhole    cubbyhole_78189996    per-token private secret storage
identity/     identity     identity_ac07951e     identity store
sys/          system       system_adff0898       system endpoints used for control, policy and debugging

Secret engines and their mounts provide two functions:

  1. The engine provides a capability, for example key/value storage or interaction with external storage (e.g. cloud facilities).
  2. Isolation - each mount cannot, by design, access another mount. This means that the is enforced isolation between even the same engine mounted a two different paths.

And authentication methods with vault auth list:

$ vault auth list
Path       Type      Accessor                 Description                 Version
----       ----      --------                 -----------                 -------
token/     token     auth_token_68d761f00     token based credentials     n/a

Enabling a secrets engine

For now, I just want to store some certificates. Vault has an entire secrets engine for certificate managementbut for now I just want to keep it simple and store/retrieve my existing ones - in the future I absolutely should let Vault manage the certificates dynamically.

Vault also has two versions of the key value engine, the first one only stores one version of each value (and more performant as a result, both for storage size and speed) while the second stores several versions and has soft-delete, and undelete operations. Again, for simplicity, I went with version one.

It is enabled with the vault secrets enable command:

$ vault secrets enable -description="Key/Value secrets" kv
Success! Enabled the kv secrets engine at: kv/

Setting up policies

Access controls are centred around the idea of “policies”, which are a bit like access control list, that are linked to credentials to grant access.

I am going to start with two policies:

  1. Administrator which can create, update, delete, list and view secrets inside the kv store.
  2. Ansible which can get secrets inside the kv store.

Once configured, this will cover most uses of the vault so I will rely on generating a “root token” using the unsealing shards for any other access, although these should not be routinely needed.

The policies can be listed with:

vault policy list

The are two default policies, root and default. The policy itself can be seen with (in this example, for default):

vault policy read default

Policies are defined in HasiCorp Configuration Language (HCL) or JavaScript Object Notation (JSON). They also match the most specific match on the path - so more specific matches will override less specific (e.g. glob) matches, as shown in the documentation example. The capabilities are generally listed on the policies concepts page, although you also need to check each secrets engine for specifics (for example, not all engines distinguish between create and update). It would have been nice if the documentation for each engine contained a simple list of capabilities they support, for referring to when creating policies.

So my Administrator policy looks like this:

path "kv/*" {
  capabilities = ["create", "update", "read", "list", "delete"]
}

and my Ansible policy is:

path "kv/*" {
  capabilities = ["read"]
}

The policy can be created (or updated) by either putting it into a file and loading with:

vault policy write administrator administrator-policy.hcl

or providing via stdin, for example:

vault policy write ansible - <<EOF
path "kv/*" {
  capabilities = ["read"]
}
EOF

Setting up user and application authentication

The accessor for the root token, generated when we initialised and the vault and used to authenticate to do this setup, exists in the vault and can be seen with:

vault list /auth/token/accessors

The current token (if token login is being used), including its id - the key, can be found with:

vault token lookup

Information about the token associated with an accessor can be seen, however the id will be hidden:

vault token lookup -accessor some_accessor_value

In general we do not want to use tokens, HashiCorp themselves have a good blog post on this topic, however their model would have Ansible retrieve a wrapped SecretID which is then sent to the system being managed and the system unwraps it, logins and then retrieves secrets directly. The difficulty with this is that it does not solve the challenge of how to authenticate Ansible in the first place. To make life simple, I will give Ansible a token for now and setup a username/password for my (admin) user.

Longer term, I suspect a token may remain the right way to authenticate Ansible itself and AppRoles used to pull secrets on the managed clients, further restricting secret access to just those hosts that consume the secrets (c.f. the model of giving Ansible access to everything to push the secrets out). Using a token, compromise of the Ansible token potentially compromises all secrets in the kv store. I have slightly mitigated this by not permitting list, so prior knowledge - which could be got from the Ansible playbook’s lookups - of what secrets can be retrieved is also required. In the scenario where the Ansible credentials, can be just get wrapped secret ids then the controlled systems retrieve their secrets directly (with each host restricted to secrets they have a need to know) then one might think that this is more secure. However Ansible can access each system, so compromise of Ansible as a whole (i.e. the vault token, the inventory and playbook(s) - presuming that is sufficient to access/configure all hosts) will still compromises all secrets.

Creating the token can be done directly:

vault token create -display-name=ansible -policy=ansible -orphan

or by creating a role, which is like a template for tokens, then using that to create the token:

vault write auth/token/roles/ansible allowed_policies=ansible orphan=true
vault token create -display-name=ansible -role=ansible

N.B. Making it an orphan stops it getting deleted when the parent token, which would be the token used to create it (in this case the root token), is deleted.

For the user, I enabled the userpass authentication method:

vault auth enable userpass

Then created my user (after storing a suitable password in the file passfile, to avoid passing it on the command-line):

vault write auth/userpass/users/laurence policies=administrator password=@passfile

N.B. The password file must not have a tailing newline (\n), or it will be included as part of the password. To save the file without a newline, with vim, first set the mode to binary then turn off eol. This will also remove the trailing newline from an already saved file.

:set binary
:set noeol
:wq

I saw tutorials online that show setting user passwords (and storing secrets) by passing these sensitive values on the command-line. I would strongly discourage this - by default on Linux any user can see the command line of any process (via /proc), this can be restricted by setting hidepid (to 1, process directories are inaccessible to all but root and the owner, or 2, process directories are hidden completely from all except root and their owner) on the /proc mount - however unless the system has been hardened it is unlikely to be set (and can break some software, particularly installers, that assume all process are visible to all users).

Adding secrets to the vault

This can be done done with (this example stores the contents of the ca.crt file in the certificate key at path icinga/certs/ca/certificate):

vault kv put -mount=kv icinga/certs/ca/certificate certificate=@ca.crt

Note that it is not possible to add to the keys at a path - any new put will replace all existing keys, so one could not store the certificates key and then add the signed certificate at the same path without re-supplying the key to the put command. This is with version 1, version 2 of the Key/Value secrets engine has a patch command which will let you update it, however in this particular case I might want to apply different access rules to the key and the certificate, so this restriction actually encourages a sensible segregation.

The secrets (and keys) can be seen with (which doesn’t seem to support the new -mount= style recommended with put and get):

vault kv list /kv

and retrieved with:

vault kv get -mount=kv icinga/certs/ca/certificate

or, to get a specific key:

vault kv get -mount=kv -field=certificate icinga/certs/ca/certificate

N.B. As some future point I will investigate using Vault’s PKI support to manage the certificate generation automatically, but one step at a time…

Mass import of existing certificates

Each host’s current icinga certificate and key are in a SaltStack Pillar data file named after the hosts fully qualified name with the dots (.) replaced by hyphens (-). For example the certificate and key for somehost.my.domain.tld is in the file somehost-my-domain-tld.sls. As these Pillar files are plain yaml, I wrote a simple python script to dump the certificates out to plain files, which I could then import directly with a small bash loop.

The format of each host’s certificate data is:

---
icinga2:
  host:
    certificate: |
      certificate_data
    key: |
      key_data
...

Where certificate_data and key_data are the actual certificate and key. Note that in this yaml host is literal (the actual word “host”), so the state just uses pillar.icinga2.host.certificate and pillar.icinga2.host.key with no need to interpolate the hostname. The correct pillar file is include using the {% include 'monitoring/icinga2/certificates/' + opts.id.replace('.', '-') + '.sls' ignore missing %} jinja2 recipe for host-specific values to general settings (like certificates where the value is host-specific but the same key used for all hosts) that I have used generally in my pillar.

The Python script to extract the certificate and key, writing the values to files name host.crt and host.key (in this case host is a placeholder for the actual fully-qualified host name), is this:

#!/usr/bin/env python3

from pathlib import Path

import yaml

for f in Path('.').glob('*.sls'):
  with open(f, 'r') as fh:
    certs = yaml.safe_load(fh)
  
  # Convert the '-' based filename back to the hosts name.
  # Of course this is not right if the original hostname has a hyphen.
  hostname = f.stem.replace('-', '.')

  with open(hostname + '.crt', 'w') as fh:
    fh.write(certs['icinga2']['host']['certificate'])
  
  with open(hostname + '.key', 'w') as fh:
    fh.write(certs['icinga2']['host']['key'])

  print(hostname, "done")

The bash loop to import them all into vault is then:

for file in *.crt
do
  base="$( basename "$file" .crt)"
  vault kv put -mount=kv icinga/certs/hosts/$base/certificate certificate=@$base.crt
  vault kv put -mount=kv icinga/certs/hosts/$base/key key=@$base.key
done

Generating and import new certificates

In the lab, the live network’s certificates do not match the domain (which has lab. prefixed) so new certificates were required. I did this using the method documented when I setup icinga2 to generate them.

After manually creating an importing the CA, I did script doing the clients (this presumes you are already authenticated to the vault):

#!/bin/bash

# Standard bash script safety:
# - abort on error
# - no uninitialised variable use
# - disable globbing
# - errors in a pipe cause the whole pipe to fail
set -eufo pipefail

host=$1

if [[ -z $host ]]
then
  cat - >& <<EOF
Usage: $0 hostname.domain.tld

hostname.domain.tld should be a fully qualified name (will be the subject of the certificate)
EOF
  exit 1
fi

sudo -u nagios icinga2 pki new-cert --cn $host --key $host.key --csr $host.csr
sudo -u nagios icinga2 pki sign-csr --csr $host.csr --cert $host.crt
# chown the files so the current user can import them to vault
sudo chown $USER $host.{key,csr,crt}
vault kv put -mount=kv /icinga/certs/hosts/$host/certificate certificate=@$host.crt
vault kv put -mount=kv /icinga/certs/hosts/$host/key key=@$host.key

echo "Generated and imported key and certificate for $host"
rm $host.{crt,key,csr}

The script can be used, e.g. for the current host: bash generate-import-vault-icinga-cert.bash $( hostname -f ).

Integration with Ansible

Ansible includes a Vault lookup plugin, however it has been replaced by a new collection. I haven’t yet mirrored any collections into my air-gapped lab, so I stuck with the bundled plugin. It requires the hvac python library to be installed - so I added that to my requirements.txt and installed it.

To use, it is like any other lookup plugin (for example):

ansible.builtin.file:
  #...
  content: "{{ lookup('hashi_vault', 'secret=kv/icinga/certs/ca/certificate:certificate') }}"

The plugin will use environment variables, such as VAULT_ADDR, so no specific configuration inside the Ansible playbooks is needed. As long as the vault address is set in the environment and authentication has been done with vault login, the lookups will just work.

Debugging vault

To debug vault, I turned on vault’s file audit device:

# Make a directory the `vault` user can write to
mkdir /var/log/vault
chown vault /var/log/vault
vault audit enable file file_path=/var/log/vault/audit.log

It is quite verbose, so I turned it off again once I had it working:

vault audit disable file

Backup up

Obviously backing up the Vault is very important. Built-in automated backups are an Enterprise version feature, so I need to do it manually.

To do this, I first created a new policy (which I called backup) and token for backups based on a recipe I found online:

path "sys/storage/raft/snapshot" {
  capabilities = ["read"]
}

I then created the token - as before I considered using an AppRole but it does not really gain anything over a token to have a static role and secret id to manage at this stage, as this is for a periodic cron job, which will need to authenticate regularly rather then being provided with time-limited secret on startup and not needing it again:

vault write auth/token/roles/backup allowed_policies=backup orphan=true
vault token create -display-name=backup -role=backup

In the longer term, I think that I will want to deploy an AppRole with the role id deployed e.g. on install or manually (although Ansible could alert to the missing presence of it) and a wrapped secret id pushed out periodically by Ansible but this requires having everything in place for Ansible to automatically update the wrapped secret in time for the backup to take place. What I don’t want to do is allow the backup to be able to use its own secret to refresh it because then compromise of the backup secret allows infinite regeneration of new backup tokens, where as requiring a supervisor (e.g. Ansible) to generate a secret that (for example - per the recipe above) grants a ticket that can only be used twice (once to login, once to perform one act - such as dump the backup) significantly limits the damage that can be done with a secret.

Anyway, for now I used a magic token, which I placed in a file owned, and only readable, by a new user called vault-backup (least privilege - only the script doing the backup needs to know the token but also it does not need to be root, so a new unprivileged user for the purpose seems the most secure option).

Backup script

I put settings into /etc/vault-backup.conf, which included VAULT_ADDR and VAULT_TOKEN, and made sure the file was owned and only readable by the vault-backup user:

VAULT_ADDR=https://vault.fqdn:8200
VAULT_TOKEN=12345

The script I placed in /usr/local/sbin (called vault-backup):

#!/bin/bash

# Standard bash script safety
set -eufo pipefail

if [[ $UID -eq 0 ]]
then
  echo "This script should be run as the vault-backup user, not root." >&2
  exit 1
fi

# Default settings, which may be overridden in /etc/vault-backup.conf
BACKUP_DIR=/srv/backups/vault  # Where to store the backups?
KEEP_BACKUPS=7  # How many backups to keep?

# Get settings (which should include VAULT_ADDR and VAULT_TOKEN)
source /etc/vault-backup.conf

# Export any 'VAULT_'y variables (variables whose names start
# VAULT_)
export ${!VAULT_*}

target_file="${BACKUP_DIR}/backup-$( date +%FT%H%M ).snap"
echo "Taking vault snapshot, saving to ${target_file}."
vault operator raft snapshot save "${target_file}"
ln -fs "${target_file}" "${BACKUP_DIR}/backup-latest.snap"

# Check to see if there are more than KEEP_BACKUPS backups
set +f  # Enable globbing
backup_files=( "${BACKUP_DIR}"/backup-[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9][0-9][0-9].snap )
set -f  # Done globbing
backup_count="${#backup_files[@]}"
echo "There are now ${backup_count} backups."

if [[ $backup_count -gt $KEEP_BACKUPS ]]
then
  echo "There are more than ${KEEP_BACKUPS} (${backup_count}) backups. Removing some:"
  # As globbing is ascii-betical (per bash man page on Pathname
  # Expansion) the files will be in ascending age order (due to the
  # format of the date and time in filename) so can just remove
  # those elements before the ones we want to keep in the array.
  for idx in ${!backup_file[@]}
  do
    if [[ $idx -lt $(( backup_count - KEEP_BACKUPS )) ]]
    then
      echo "Removing ${backup_files[idx]}..."
      rm "${backup_files[idx]}"
    fi
  done
  echo "Finished removing old backups."
fi

I then created a cronjob, using my cron-wrapper to only email output if it fails, I added a cronjob (in /etc/cron.d/vault-backup) for it to run daily at 2am:

0 2 * * * vault-backup /usr/local/bin/cron-wrapper /usr/local/sbin/vault-backup

Monitoring backup

Finally, I added a check to Icinga that verifies the backup has been updated in the last 24 hours. This involved adding a service check by creating services-hasicorp-vault.conf:

apply Service "check-vault-backup" {
  import "generic-service"

  check_command = "file_age"
  command_endpoint = host.name // Execute on client

  vars.file_age_file = "/srv/backups/vault/backup-latest.snap"
  vars.file_age_warning_time = 86400 // 1 day (24 hours, 86,400s)
  vars.file_age_critical_time = 129600 // 1.5 days (36 hours, 129,600s)

  assign where host.vars.services && "hashicorp-vault" in host.vars.services
}

Then added hashicorp-vault to the list of services on the relevant host:

object Host "xxxxxx.domain.tld" {
  vars.services = [
    "hashicorp-vault"
  ]
}

Finally, I changed the ownership of the /srv/backups/vault directory to vault-backup:nagios with mode 0750 - so the nagios user can read the state of files but not modify (e.g. delete) them. The backup files themselves are owned by vault-backup:vault-backup so this does not grant the nagios group members access to the contents of the backups.

Revoking the initial root token

Finally, I revoked the initial root token:

vault token revoke -self

If a new root token is required, in the future, one can be generated using the unseal keys and the operator command. This process must be started with the -init option, which will print a nonce and one-time password (OTP):

vault operator generate-root -init

The nonce value will be required to continue the process with at least 2 other unseal keys (the command will prompt for the nonce and then unseal key - the nonce can be auto-completed):

vault operator generate-root

The final user will be given the encoded token, which will need to be passed to the -decode option with the OTP to display the actual token:

vault operator generate-root -decode=encoded_token -otp=OTP_from_init

Once Vault was up and running, I carried on with finishing migrating monitoring configuration to Ansible.