About a month and half after setting up a proper secrets store with HashiCorp Vault I am able to pick up, crossing off a number of roles migrated from SaltStack in the process. This is with no secrets stored in the code, unlike my Salt configuration which had secrets in the pillar data. With the infrastructure setup and linked to Ansible, I am able to continue with finishing this migration task.

As a recap, I am currently migrating my monitoring server’s salt roles from my states tree (nesting indicates where one role include another) to Ansible. Completed ones are crossed through:

  • server
    • remotely-accessible (installs and configures fail2ban)
    • monitoring.client (installs and configures icinga2 client)
      • monitoring.common (installs monitoring-plugins and nagios-plugins-contrib packages, installs & configures munin client and nagios’ kernel & raid checks.)
  • monitoring.server (installs and configures icinga2 server, php support for nginx, icingaweb2 and munin server)
    • webserver (installs and configures nginx)
    • monitoring.common (see above)


The key difference between the client and server systems for monitoring, is that the servers are all exposed via a webserver for human monitoring (no single-pane of glass, yet). The first thing that I needed to do was to setup a webserver. As before, this should be a fairly straight-forward duplication of my SaltStack roles.

I started by adding a webserver role. In Salt, the sites required are added to an accumulator and applied in the webserver state. In my Ansible role, I create tasks/main.yaml to install nginx (my webserver of choice, although I am trying to make the webserver role’s arguments such that any webserver could be configured with it) and do the baseline configuration and a separate entry-point called add_site.yaml which will add a site (creates the configuration file, from arguments given to it, and creates the ‘sites-enabled’ symlink to it to make it active). A handler restarts nginx, so add_site can notify that and it will get restarted once for all sites that are added.

I used my firewalld configuration from installing Vault as a template for allowing http and https if firewalld is installed.

SSL certificates with Vault

To deploy the certificates, for TLS/SSL, I imported them into HashiCorp Vault using the structure ssl/certs/hosts/<hostname>/{certificate,ca,key} where <hostname> is the fully-qualified hostname and ca contains the CA bundle (used to generate both a CA bundle and a full-chain file). These are then pushed out by Ansible. The ca contains the key bundle with the bundle, the others (certificate and key) contain a key matching the name.

In Ansible, I created this data structure for the host certificate at the group level:

  certificate: <certificate>
  ca_bundle: <ca_bundle>
  key: <key>
    certificate: <path-to-store-certificate>
    ca_bundle: <path-to-store-ca-bundle>
    full_chain: <path-to-store-full-chain>
    key: <path-to-store-key>
ssl_dhparm_4096_file: <path-to-store-dhparam>

This abstracts the lookup away from the role (so the role pushing out the certificates is agnostic to where the certificate comes from, Vault in this case) and the files, which need to be known by multiple other roles for their own configuration, are configured in one place.

I originally made ssl_host_certificate_files a files key within ssl_host_certificate but it occurred to me that these paths might not be consistent across Linux distributions, and certainly not across operating systems, so I separated them. I do prefer SaltStack’s behaviour of merging, rather than overwriting, data dictionaries for cases like this.

To only attempt a lookup on hosts with certificates in the vault, I created a new group ssl_host_certificate in my Ansible inventory and applied the above variables (with vault lookup) to that group via a group_vars file and the fqdn fact, in the lookup, to pull out the host-specific certificate.

I used the existing ssl-cert group to provide non-root software that needs to use the host’s certificate access to the key but manage it in one place. Now this theoretically allows software to purport to be any hostname in the certificate’s SAN list but I have bigger problems if the software I’m giving access to the certificates it untrustworthy or leaks private key. This can be mitigated by giving each software its own certificate but that creates additional overhead to manage those certificates, including more places permissions could be incorrectly set to expose the keys. In short: pick your poison, I chose to do it this way but there are different approaches with different risks.

The ssl-cert group needs “execute” permissions in the /etc/ssl/private directory (allowing traversal, that is access to the contents, but not read, which allows listing the contents). In effect, this means members of that group will be able to access a key file if they know (or can guess) its name - as the saying goes, “security through obscurity is not security” but it creates a small additional hurdle to compromising the key. Within the directory, the group needs read access on the key(s) that software running as users within that group should be able to access.

This allows me to go back the SSL certificate for Vault that I manually deployed and point Vault at the server’s copy (they are the exact same certificate, vault.… is just an alternate name in the SAN) after adding the vault user to the ssl-cert group. This does, however, create a chicken-and-egg problem for disaster recovery [DR] as the certificates for vault are now pushed out, by Ansible, from Vault. I will need to deal with this in my new DR plan - my current thinking is a specific DR Ansible playbook to bootstrap Vault sufficiently for it to be recovered from backup, then Ansible can be rerun with that data. Update 2023-05-11 While still working on this blog post, the certificate I copied from the live network to my air-gapped lab expired - another scenario in which I became locked out of Vault (because the certificate had expired) so could not use Vault to deploy the new certificate (or even import it) until I manually coped the new certificate to the Vault server and restarted the service. I also had to regenerate my ansible and backup Vault tokens because they had expired (maximum TTL is 32 days by default in Vault).

SSL certificates in nginx

Nginx’s master process runs as root, so the nginx user does not need adding to the ssl-cert group. Doing so would allow the unprivileged worker processes, potentially running server-side scripts, access to the certificate’s key which would be bad from a security perspective.

The adding of all of the services as SANs to a top-level (host) certificate allows nginx to be more efficient, per this excerpt from the documentation: “It is better to place a certificate file with several names and its private key file at the http level of configuration to inherit their single memory copy in all servers”.

This is what I did, simply specifying ssl_dhparam, ssl_certificate and ssl_certificate_key in the file /etc/nginx/conf.d/ssl-cert.conf. I added arguments for the ssl files (dhparam, certificate and certificate key) to my webserver Ansible role and had it create the file, using those values, if they are defined.

Monitoring server

Additional icinga2 configuration

In addition to the common configuration, applied to all servers, some extra setup is required on the master server.

The zones configuration needs cloning from git into /etc/icinga2/zones.d. In order to avoid “dubious ownership” errors from git, the directory first needs its owner changing from nagios to root (the nagios user deliberately cannot change the files in this directory), the group owner remains nagios and its mode changed to 750. The configuration is then cloned as root into it.

I added a constant called InternalDomain to /etc/icinga2/constants.conf and altered my “master” zone’s hosts.conf to use this constant instead of hardcoded fully-qualified names. This is so that my configuration will work (hopefully) unmodified in my live network and the lab (which I gave a different sub-domain).

Postgresql needed installing, along with the icigina2-ido-pgsql package (on Debian) and the ido-pgsql icinga feature enabling.

The icinga CA certificate and key were copied to /var/lib/icinga2/ca.

My recently updated NextCloud version check plugin also needed to be installed on the server.

Icinga Web

Additional webserver setup

In addition to nginx, I need to setup php (for Icinga Web 2). I started by adding a php entry point (tasks/php.yaml and associated entry in meta/argument_specs.yaml) to add php support to the webserver (trying to keep the webserver role as agnostic as possible to the actual server being used, i.e. nginx could easily be replaced with lighttpd or apache).

For nginx, this meant installing php-fpm and configuring an upstream group to proxy to it, which I called php-handler. I also configured the values of upload_max_filesize, post_max_size, memory_limit and date.timezone in php-fpm’s php.ini (which is in /etc/php/<php version number>/fpm/php.ini on Debian). I did this now, although it is predominantly for other hosts that are hosting php-based email or sync-and-share software.

The only slightly complicated but was determining the version of php-fpm, in order to control the service (which has the version in the name) and the location of the relevant php.ini. I did this by using Ansible’s [ansible.builtin.package_facts] (https://docs.ansible.com/ansible/latest/collections/ansible/builtin/package_facts_module.html) module, checking there is only one php-fpm version installed (on Debian, this is a meta-package depending on the appropriate version package) and parsing the version and stripping off the Debian epoch and everything after the plus in the upstream_version (there is no debian_revision on my test system). Although, as you will see from the comment, I did later wonder if I should just retrieve the version number from /etc/alternatives/php-fpm.sock’s target instead?):

    # Or should this be determined from the target of
    # /etc/alternatives/php-fpm.sock (which /run/php/php-fpm.sock is a
    # symlink to)?
    - name: Find the php-fpm version (required for service name and configuration path)
    - name: Ensure one unique version of php-fpm was found
        that: ansible_facts.packages['php-fpm'] | length == 1
        fail_msg: More than one package version of php-fpm is installed
    - name: Extract version information from installed package
        # Version of the form (e.g.) "2:7.4+76" - extract the
        # major.minor (discard epoch and +update, if present).
        php_fpm_version: >-
            | regex_search('^(?:[0-9]+:)?([0-9\.]+)', '\1')
            | first

This fact I then used for the service and configuring PHP, e.g.:

    - name: Ensure service is enabled and running
      become: true
        name: php{{ php_fpm_version }}-fpm
        enabled: true
        ensure: started
    - name: Configure maximum upload size
      become: true
        path: /etc/php/{{ php_fpm_version }}/fpm/php.ini
        regexp: ^;?upload_max_filesize = .*$
        replace: upload_max_filesize = {{ php_max_upload_size }}
      notify: Restart php-fpm

Installing Icinga Web

I just used the Debian packages to install icingaweb2, icingacli (to configure its modules), php-pgsql (to connect to the icinga database) and php-curl (to connect to icinga’s API).

Configuring Icinga Web monitoring module

Firstly the monitoring module needs enabling with icingacli module enable monitoring. In my Salt state, I grepped the output of icingacli module list for ^monitoring but this time I decided to use the existence of /etc/icingaweb2/enabledModules/monitoring to determine if it was already enabled or not.

Some configuration files need to be added in /etc/icingaweb2/modules/monitoring too (so that directory needs creating, if it does not already exist). The files need to be readable by the icingaweb2 group, so I made them root:icingaweb2 owned with mode 440 (directories I set the SGID bit to on, mode 2750, so the contents will default to the same group ownership).

The configuration files needed are:

  • config.ini - set security settings (protected variables, whose values are not displayed)

      protected_customvars = "*pw*,*pass*,community"
  • backends.ini - connects Icinga Web to the Icinga database

      type = "ido"
      resource = "icinga2-db"
  • commandtransports,ini - connects Icinga Web to Icinga’s API

      transport = "api"
      host = ""
      port = "5665"
      username = "icingaweb2"
      password = "{{ icingaweb2_api_password }}"

icingaweb2_api_password I added as an argument to the role and populated it (outside the role) via a HashiCorp Vault lookup (I actually set it as a group variable on the monitoring_servers group in my Ansible setup). I used a quick and dirty hack to generate the password (yes, I really am still working on this post, started on 12 March, on 22nd May) - for simplicity’s sake I excluded " from the password, as it will be quoted in the ini file, and \\ (which has to be escaped twice, once for bash second time for tr) as that caused problems in the Icinga configuration file:

umask 077 # Ensure only we can read the file
# Using a temporary file so the password never appears on the command line (which anyone can see)
tr -dc '[:print:]"\\\\' < /dev/urandom | head -c64 > /tmp/icingawebpass
vault kv put -mount=kv /icinga/api/users/icingaweb2 password=@/tmp/icingawebpass
rm /tmp/icingawebpass

To separate the password (now environment specific) from the configuration (which also keeps it out of revision control), I added a new constant like I did earlier for the internal domain called Icingaweb2ApiPassword to the Icinga contants.conf configuration (keeping it outside of the Icinga Web tasks, so that Icinga Web could be deployed on a separate host to Icinga itself, easily, in the future) and updated master/api-users.conf in my zones.d to use this constant instead of having a hardcoded password for the icingaweb2 user (with a comment saying where the constant is defined).

These files I created by pushing out files with ansible.builtin.copy. The ini-file specific module, community.general.ini_file only modifies individual settings in an ini file and says its purpose is “Manage (add, remove, change) individual settings in an INI-style file without having to manage the file as a whole with, say, ansible.builtin.template or ansible.builtin.assemble.” which implies that one should use other modules to manage whole ini files. To reduce duplication, I put the ini file content into a data structure alongside the filename and looped over each file. This could be further abstracted into a generic ini-file generator:

- name: Monitoring config
  become: true
    mode: 00640
    owner: root
    group: icingaweb2
    dest: /etc/icingaweb2/modules/monitoring/{{ item.filename }}.ini
    content: |
      {% for (section, data) in item['values'].items() %}
      [{{ section }}]
      {%    for (key, value) in data.items() -%}
      {{ key }} = "{{ value }}"
      {%    endfor +%}
      {% endfor %}
    - filename: config
          protected_customers: "*pw*,*pass*,community"
    - filename: backends
          type: ido
          resources: icinga2-db
    - filename: commandtransports
          transport: api
          port: 5665
          username: icingaweb2
          password: "{{ icingaweb2_api_password }}"
    label: "{{ item.filename }}"

Note the use of the + in {% endfor +%} to disable Ansible’s default trim_blocks setting to Jinja, to preserve a blank line after each complete section (so the file is more easily humanly readable).

Configuring Icinga Web

This is very similar to configuring the monitoring module, it is just a collection of ini files - the only “special” bit is the database password. This I slurped from the Icinga configuration file - a random one is automatically generated on install - although I could have imported it into my HashiCorp Vault (taking control of managing it) and/or replaced it with something more secure. I may want to do this in the future to enable periodic rotation of the password.

The configuration files (in /etc/icingaweb2) are:

  • resources.ini - contains the icinga2-db resource used by the monitoring plugin

      type = "db"
      db = "pgsql"
      username = "icinga2"
      password = "{{ icingaweb2_icinga_database_password }}"
      dbname = "icinga2"
      host = ""
  • config.ini - core configuration

      log = "syslog"
      level = "ERROR"
      application = "icingaweb2"
      config_backend = "ini"
  • authentication.ini - delegate authentication to the webserver

      backend = "external"
  • roles.ini - defines roles and which users are in those roles

      users = "icingaadmin"
      permissions = "*"
      users = "laurence"
      permissions = "module/monitoring,monitoring/command/*"

    I considered linking users to roles in HashiCorp Vault and populating that information from there, however I decided separating role definitions from user credentials was better as two places have to be compromised to get authentication tokens (i.e. password) and elevate a user’s access level.

Creating these was done with a carbon-copy of the generator used to conigure the monitoring module, just with a different path and loop data. I am considering making this a role, or some other means of templating it as a generic “ini-file maker”.

The only “interesting” bit is extracting the password from the Icinga configuration file. The first step is to “slurp” the file on the remote host, which returns the contents Base64 encoded:

- name: Retrieve Icinga 2 database configuration
  become: yes
  become_user: nagios
    src: /etc/icinga2/features-enabled/ido-pgsql.conf
  register: icinga2_database_config

From this configuration, I extracted the password like this, which is then used to populate the variable for the configuration loop:

  | b64decode
  | regex_search('password = "([^"]+)"', '\1', multiline=true)
  | first

Icinga Web webserver (and users)

One final piece of the Icinga Web piece is the users. Access is controlled by a htpasswd file. This is used by the webserver to authenticate users. In order to minimise the effort required to add/remove users, I altered the ansible HashiCorp Vault role to be able to list the users (or, specifically, entries in /kv/icingaweb2/users where I created entries named for the user with a key password containing the user’s password).

path "kv/*" {
  capabilities = ["read"]

path "kv/icingaweb2/users/" {
  capabilities = ["list"]

To get the users from the vault, I used the community.hashi_vault.vault_list lookup and populate the list in a loop:

- name: Populate icingaweb2_users variable from vault
    icingaweb2_users: >-
        + [{
          'username': item,
          'password': lookup('community.hashi_vault.vault_read', '/kv/icingaweb2/users' + item).data.password,
    icingaweb2_users: [] # Initial value - fact will take precedence once defined
  loop: "{{ query('community.hashi_vault.vault_list', '/kv/icingaweb2/users')[0].data['keys'] }}"
  no_log: true

To get this working I had to install the collection - I added a download role (which does a straight fetch from a url to a local file) to download the collection’s tarball from Ansible Galaxy to my air-gapped lab’s mirror syncing playbook. Once in the mirror, I could install it directly with the ansible-galaxy command. It seems manually identifying and fetching the tarball is the supported way to install offline, although there is another page that talks about a download sub-command for this - I will need to investigate this further as it does not seem to be supported by the version of ansible-galaxy I have installed.

Once downloaded and copied to the network’s mirror, it can be installed directly from there:

ansible-galaxy collection install http://mirror/mirrors/ansible-galaxy/community-hashi_vault-5.0.0.tar.gz

I considered revisiting the Icinga api users configuration and applying this pattern there too but decided this was better left for another day.

With the usernames and passwords in a fact, putting them into the file was easy (I did need to install the passlib python module on the remote system for the htpasswd Ansible module):

- name: Icinga web users
    - name: Ensure passlib is installed to be able to manage htpasswd files
      become: true
        name: python3-passlib
        state: present
    - name: Add/update users in htpasswd file
      become: true
      # On newer versions of Ansible
        path: /etc/icingaweb2/htpasswd
        mode: 00440
        owner: root
        group: icingaweb2
        name: "{{ item.username }}"
        password: "{{ item.password }}"
      loop: "{{ icingaweb2_users }}"
        label: "{{ item.username }}" # Don't show the password!
    # Remove invalid users from htpasswd file
    - name: Slurp htpasswd file
      become: true
        src: /etc/icingaweb2/htpasswd
      register: icingaweb2_htpasswd
    - name: Remove non-existent users from htpasswd
      become: true
      # On newer versions of Ansible
        path: /etc/icingaweb2/htpasswd
        mode: 00440
        owner: root
        group: icingaweb2
        name: "{{ item }}"
        state: absent
      # Can't fold this as you can't escape (an therefore split on newline) in folded text.
      loop: "{{ icingaweb2_htpasswd.content | b64decode | trim | split('\n') | map('split', ':') | map('first') }}"
      when: item not in icingaweb2_users | map(attribute='username')

The last thing to do is to add the site to the webserver. This is quite specific to nginx but I am in two minds about whether to pursue trying to generalise this (so it does not matter which webserver is used) or not because I will end up with a meta-language to describe everything that I need to configure in each webserver, which feels like it’s adding to the each webserver has its own configuration language problem not solving it.

To actually add the site, I use my existing webserver role:

- name: Configure webserver - redirect http to https
    role: webserver
    tasks_from: add_site
    site_name: icingaweb2
    server_name: icinga.{{ ansible_facts.domain }}
      return: '301 https://$host$request_uri'
- name: Configure webserver
    role: webserver
    tasks_from: add_site
      - '443 ssl http2'
      - '[::]:443 ssl http2'
    site_name: icingaweb2-ssl
    server_name: icinga.{{ ansible_facts.domain }}
    root: /usr/share/icingaweb2/public
    index: index.php
      extra_configuration: |
        auth_basic "Icigna";
        auto_basic_user_file /etc/icingaweb2/htpasswd;
        try_files $1 $uri $uri/ /index.php$is_args$args;
        - location: '/index.php'
          configuration: |
            fastcgi_pass php-handler;
            fastcgi_index index.php;
            include fastcgi_params;
            fastcgi_param SCRIPT_FILENAME /usr/share/icingaweb2/public/index.php;
            fastcgi_param ICINGAWEB_CONFIGDIR /etc/icingaweb2;
            fastcgi_param REMOTE_USER $remote_user;

Munin server

The final piece of the monitoring jigsaw is the Munin server (munin-node was previously setup on the clients back in January (it is now 29th May, as I finally come to the end of this post)).

Firstly, I installed the munin package.

I then replaced the bundled configuration, which mixes an import of the configurations in /etc/munin/munin-conf.d with other settings and an example localhost.localdomain host, with one that just imports the modular configuration:

# (Exactly one) directory to include all files from
includedir /etc/munin/munin-conf.d

The first module file I created had the timeouts from the bundled master configuration, I called it /etc/munin/munin-conf.d/timeouts.conf:

# The maximum time the munin-update may take to get updates from all nodes,
# this might be interesting when using munin-async in case of large transactions and/or backlog.
# When using the munin protocol to connect to a node, then this value shouldn't be set higher than 240.
# In case it's higher, gaps might be seen in the graphs
timeout_fetch_all_nodes 240

# The maximum amount of time in seconds we may work on 1 node.
# The value will be limited with timeout_fetch_all_nodes.
timeout_fetch_one_node 180

For the client configuration, the node names must match those advertised by munin-node on the clients. I installed munin-node on all systems in the servers group so that is what I used to select the clients - I might want to revisit this decision to make adding systems to munin’s monitoring more flexible (e.g. monitor desktops too). To make the role as portable as possible, I built the list outside the role and passed it in as an argument.

I made the list a list of dictionaries, currently with keys name (name of client, as advertised - also used for filename) and address (network address - on my networks this is the same as the name, except for the munin server itself for which I use localhost as a the address). This gives scope for adding other settings per-host in the future. The task to do this, in site.yaml (currently) is:

- name: Build list of munin clients
    munin_clients: >-
        munin_clients + [{
          'name': client_name,
          'address': client_address,
      munin_clients: [] # Initial value, fact will take precedence once set
      client_name: >-
          if 'ansible_facts' in hostvars[item]
          'fqdn' in hostvars[item]['ansible_facts']
        %}{{ hostvars[item]['ansible_facts']['fqdn'] }}{%
        %}{{ item }}{%
      client_address: >-
          if item == inventory_hostname
        %}{{ client_name }}{%
    loop: "{{ hostvars.keys() }}"
    when: "'servers' in hostvars[item].group_names"

The block to actually deploy the client configuration files to munin server is in the monitoring role, which has this additional argument to support it:

  description: List of dictionaries of client information
  type: list
  elements: dict
      description: Client's name (as advertised by the client)
      type: str
      description: Client's network address (can be resolvable DNS name or IP)
      type: str

The block to create the configurations, and remove any defunct client configurations, is:

- name: Munin client configurations
    - name: Make client configuration files
      become: true
        dest: /etc/munin/munin-conf.d/client_{{ item.name }}.conf
        owner: root
        group: root
        mode: 00444
        content: |
          [{{ item.name }}]
          address {{ item.address }}
      loop: "{{ munin_clients }}"
    - name: Get list of configured clients
          - /etc/munin/munin-conf.d
          - client_*.conf
      register: munin_client_configurations
    - name: Remove clients that no longer exist
      become: true
        path: "{{ item }}"
        state: absent
      loop: "{{ munin_client_configurations.files | map(attribute='path') }}"
      # strip off "client_" and ".conf"
      when: (item | basename)[7:-5] not in munin_clients | map(attribute='name')

The very last bit is to configure the webserver. Munin does have various cgi bits, for dynamic generation (although only an Apache example is provided) and zoom and can provide its own webserver daemon to be proxied. However, my existing configuration only provides the static html so I reproduced this (to get back to building the bastion aspect, which is the whole point of this migrating the monitoring server to Ansible management exercise) using my existing webserver role:

- name: Webserver configuration
    # XXX Very nginx specific!
    - name: Configure webserver - http to https redirect
        role: webserver
        tasks_from: add_site
        site_name: munin
        server_name: munin.{{ ansible_facts.domain }}
          return: '301 https://$host$request_uri'
    - name: Configure webserver
        role: webserver
        tasks_from: add_site
          - '443 ssl http2'
          - '[::]:443 ssl http2'
        site_name: munin-ssl
        server_name: munin.{{ ansible_facts.domain }}
        root: /var/cache/munin/www


Finally, after 6 months(it is now 31st May 2023) I have completed migration of the monitoring server from SaltStack to Ansible (completed ones are crossed through) and can resume building the bastion host, after which I will be able to return to using Ansible to configure iPXE and then I will be able to use that to bootstrap Proxmox nodes which I now also want to do in my live network, somewhat urgently:

  • server
    • remotely-accessible (installs and configures fail2ban)
    • monitoring.client (installs and configures icinga2 client)
      • monitoring.common (installs monitoring-plugins and nagios-plugins-contrib packages, installs & configures munin client and nagios’ kernel & raid checks.)
  • monitoring.server (installs and configures icinga2 server, php support for nginx, icingaweb2 and munin server)
    • webserver (installs and configures nginx)
    • monitoring.common (see above)