Adding a bastion host - migrating monitoring client roles to Ansible

This post is the fourth in the chain of posts starting with trying to get started with Ansible for managing my own infrastructure in October, heading down a rabbit-hole by trying to work around Ansible not playing nicely with 2 factor sudo authentication. It is the last in a series of three posts that I split out from the second in the series on the 2nd January 2023 and is the blog content I added yesterday explaining taking the monitoring role from SaltStack to Ansible.

As a recap, these are the salt roles from my states tree (nesting indicates where one role include another) I am migrating to Ansible, with the completed ones ~~crossed through~~:

server
- ~~remotely-accessible (installs and configures fail2ban)~~
- monitoring.client (installs and configures icinga2 client)
  - monitoring.common (installs monitoring-plugins and nagios-plugins-contrib packages, installs & configures munin client and nagios’ kernel & raid checks.)
monitoring.server (installs and configures icinga2 server, php support for nginx, icingaweb2 and munin server)
- webserver (installs and configures nginx)
- monitoring.common (see above)

Migrating monitoring

I merged monitoring.client, monitoring.server and monitoring.common into a single monitoring role which sets up iginca2 and munin for either client or server, depending on which component is selected via the role’s arguments (exactly as I did for the ssh role). I also gave it a second entry point, munin-node-plugins (by creating tasks/munin-node-plugins.yaml and adding it to meta/argument_specs.yaml) to make it easy to deploy/configure additional plugins without creating a separate munin role. Depending on how this role grows, though, I may end up creating a monitoring collection with separate roles for, e.g., Icinga, Munin etc. but at the moment combining them in one role seems manageable.

sudo role

Because the community.general.sudoers Ansible module does not support host-based restrictions, I created my own role that does. It is very simple and behaves exactly like the module, overwriting the file entirely and using the same argument names, to make swapping back easy when the future releases add this support (host and nopassword default to ALL and true respectively via the role’s defaults/main.yaml, to match the module):

---
- name: Ensure sudo is installed
  become: yes
  ansible.builtin.package:
    name: sudo
- name: Create sudoers file
  become: yes
  ansible.builtin.copy:
    dest: /etc/sudoers.d/{{ name }}
    content: |
      {{ user }} {{ host }}={% if runas | default(false) %}({{ runas }}){% endif %}{% if nopassword %}NOPASSWD: {% endif %} {{ commands | join(', ') }}
    owner: root
    group: root
    mode: 0400
...

I then used this role to grant access required for the munin and icinga checks that need to run commands to do things their respective user’s cannot do.

munin-node

In order use Ansible’s community.general.dig DNS lookup module to lookup the monitoring server’s IP from DNS (to put it in the munin-node daemon’s cidr_allow list), I had to install dnspython library.

Other than this, migrating munin-node was a straight duplication of my existing SaltStack role into Ansible.

Icinga

Each endpoint (i.e. node/host) in Icinga needs to know about the zone it is in (including all endpoints) and any endpoints it connects to or from, which means the parent and any child zones (including all of their endpoints). This is done in the zones.conf file. This is explained in detail, with diagrams, in Icinga’s distributed monitoring documentation.

The original configuration

In my SaltStack configuration for Icinga’s zones file, I listed the endpoints, specifying their relationship in terms of which connects to which, and then the zones, including their relationship to each other, separately. This resulted in a bit of duplication (Icinga’s convention is to use the fully-qualified domain name [FQDN] for endpoint and zone name):

icinga2:
  endpoints:
    monitoring-server:
      connect_to:
        - vmhost
        - backup-server
        - command-and-control-server
        - project-hosting-server
        - vps
        - router
    vps:
      connect_to:
        # 'home' is an external address for the router, used to
        # monitor if the house's internet connection is up from the
        # vps' perspective.
        - home
  zones:
    master:
      endpoints:
        - monitoring-server
    vmhost:
      parent: master
    vps:
      parent: master
    backup-server:
      parent: master
    command-and-control:
      parent: master
    project-hosting-server:
      parent: master
    router:
      parent: master
    home:
      parent: vps

At a configuration level, the connection direction is determined by which agent’s zone.conffile has the host attribute for an end point (the logic is simple - if the host attribute is present, icinga will actively try to connect so master -> agent or agent -> master is simply a case of whether the master has all the agent’s host set or the agent has the master’s host set). For historical reasons (pre VPN, and the master is behind a NAT router), I have a master -> agent setup and have kept this for now but I am concious that with the monitoring master outside the server network, switching to agent -> master would reduce the new (intended to be more secure) server network’s attack surface as the icinga connections would become outbound rather than inbound.

Taking the opportunity of refreshing the configuration management to have a rethink about how the icinga configuration is specified, I noted:

Zone names, with the exception of the master root zone, are always the FQDN of the system.
As described in the Icinga2 documentation, it is important to choose one connection direction.
Connection direction dictates whether endpoints connect up (to the parent zone) or down (to the child zones) - in either case, the zone relationship alone is sufficient to determine what connects to what.

The Ansible configuration

Reflecting on this, I realised I can simplify the configuration data by:

Optionally specifying zone name via a icinga_zone variable on the hosts, implicitly using the host’s name if it is missing
Specifying the zone’s parent (which might be none) via a variable at group level, rather than duplicate a list of hostnames (as zones) elsewhere

This did mean that I had to think a little about the group’s priority so that the parent is set correctly on the monitoring servers (no parent) compared to the other servers (parent is the master zone). My icinga configuration in my Ansible inventory now looks a little like this:

# Dummy group are either non-existent aliases or systems not managed
# by Ansible. The group exists to allow variables about those hosts to
# be added to the inventory so they may be consumed in Playbooks for
# e.g. monitoring and backup server configuration.
dummy:
  hosts:
    home:
      icinga_zone_parent: vps
servers:
  children:
    backup_servers:
    monitoring_servers:
    ups_servers:
    bastions:
  vars:
    icinga_zone_parent: master
monitoring_servers:
  hosts:
    monitoring-server:
  vars:
    # Must be higher than servers to override icinga_zone_parent for
    # this group. Default priority is 1.
    ansible_group_priority: 10 
    icinga_zone: master
    # Monitoring servers have no parent (are in the master root zone)
    icinga_zone_parent:

With this data structure, adding a new host to the servers group (or any of its sub-groups) now automatically adds it as a child of the master Icinga monitoring zone, adding a new host to the monitoring_servers group will automatically add it to the master zone and clear the parent (as master is the root). In effect, no configuration is now needed when adding a host (provided it is placed in the right group(s)) to configure monitoring on and of it.

Enabling persistent fact caching

As I have started using inventory variables on other hosts to configure the monitoring server, including using facts such as ansible_facts.fqdn, I needed to turn on persistent fact caching so that the facts from other hosts persist when targeting specific hosts for configuration, for example, when updating the configuration of just the monitoring server. I enabled the bundled jsonfile cache plugin, although I might look at the redis one in future. jsonfile requires a path configuring to store the cache files in, I told it to use ~/.local/cache/ansible-facts (based on a StackExchange Q&A about the .local directory). I enabled it by adding ansible.cfg to my Ansible playbooks’ root:

[defaults]
fact_caching=ansible.builtin.jsonfile
fact_caching_connection=~/.local/cache/ansible-facts
# Disable cache timeout
fact_caching_timeout=0

Zone endpoints filter

I initially tried to create a list of endpoints in each zone from the hostvars Ansible magic variable, which I was able to do but logic like “use ansible_facts.fqdn if it exists or inventory_hostname if it does not” is difficult to express in Jinja and resulted in a long piece that added together lists from filtering hostvars 4 times, once for each possibility (icinga_zone present/missing, ansible_facts.fqdn present/missing). In the end, I created a filter plugin instead, which was very straight forward - I just dropped the filter’s python file into a directory called filter_plugins and was then able to use it. I followed the Ansible standard, as recommended on the developing plugins page including using /usr/bin/python instead of the more compatible /usr/bin/env python (although I wonder why they insist on including a shebang line in a plugin at all?):

#!/usr/bin/python
# -*- coding: utf-8 -*-

# Copyright 2022 Laurence Alexander Hurst
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)

DOCUMENTATION = """
  name: icinga_zone_endpoints
  author: Laurence Alexander Hurst
  version_added: "1.0.0"
  short_description: Get a dictionary of endpoints in the specified zone
  description:
    - Returns a dictionary of endpoints mapping to configuration information for those endpoints.
  options:
    _input:
      description: hostvars (or equivalent map of hostnames -> data about those hosts)
      type: dict
      required: true
    zone_name:
      description: The zone name to find the endpoints for
      type: str
      required: true
    add_host_key:
      description: Whether to add the node's FQDN (or inventory_hostname if there is not FQDN fact) as `host` to each endpoint's configuration.
      type: bool
      default: false
    domain_name:
      description: Domain name to add to unqualified inventory_hostname values (if there is no '.' in it).
      type: str
      default: None
EXAMPLES: |
  # To get the endpoints in the `master` zone
  {{ hostvars | icinga_zone_endpoints('master') }}
RETURN:
  _value:
    description: A dictionary of endpoint name to configuration for each endpoint
    type: dict
"""

class FilterModule(object):
  def filters(self):
    return {'icinga_zone_endpoints': self.icinga_zone_endpoints}

  def icinga_zone_endpoints(self, hostvars, zone_name, add_host_key=False, domain_name=None):
    result = {}
    for host in hostvars.values():
      host_node_name = None  # Work out the host's Icinga endpoint name
      host_zone_name = None  # Work out the host's zone

      # Do we have a FQDN fact for the node's name?
      if 'ansible_facts' in host and 'fqdn' in host['ansible_facts']:
        host_node_name = host['ansible_facts']['fqdn']
      # if not, fall back to the inventory_hostname - should only apply
      # to dummy hosts in the inventory, real hosts should have facts (
      # if persistent fact cache is enabled).
      elif '.' in host['inventory_hostname'] or domain_name is None:
        host_node_name = host['inventory_hostname']
      else:
        host_node_name = "%s.%s" % (host['inventory_hostname'], domain_name)

      # Is there an explicit zone set on this host?
      if 'icinga_zone' in host:
        host_zone_name = (host['icinga_zone'] == zone_name)
      # if not, use the endpoint name as zone
      else:
        host_zone_name = host_node_name
      
      if host_zone_name == zone_name:
        result[host_node_name] = {'host': host_node_name} if add_host_key else {}
    return result

I then used this to build a dictionary of zones and their endpoint, relevant to the current host, in a icinga_zones (not to be confused with icinga_zone, which holds the host’s zone) fact - the block is purely to logically keep the tasks together, it serves no functional purpose:

  tasks:
    - name: Build list of icinga zones relevant to this server
      block:
        # Persistent fact caching is required, so endpoint's fqdn facts
        # are accessible even when only updating a subset of icinga
        # nodes.

        # We care about 3 groups of zones:
        #   1. This system's zone (including any other endpoints in the same
        #      zone).
        - name: Add this system's zone
          ansible.builtin.set_fact:
            # As this system's zone should be the first one set, it is
            # fine to overwrite anything already in icinga_zones.
            icinga_zones: "{{ {
              icinga_zone | default(ansible_facts.fqdn)
              :
              {
                'endpoints'
                :
                hostvars
                | icinga_zone_endpoints(
                    icinga_zone | default(ansible_facts.fqdn)
                    ,
                    domain_name=ansible_facts.domain
                  )
              }
            } }}"

        #   2. This system's zone's parent zone (might be none if this is
        #      the master zone).
        - name: Parent zone
          block:
          - name: Add this system's parent zone to its zone definition
            ansible.builtin.set_fact:
              # As this system's zone should be the first one set, it is
              # fine to overwrite anything already in icinga_zones.
              icinga_zones: "{{ {
                icinga_zone | default(ansible_facts.fqdn)
                :
                icinga_zones[icinga_zone | default(ansible_facts.fqdn)]
                | combine({'parent': icinga_zone_parent})
              } }}"
          - name: Add this system's parent zone
            ansible.builtin.set_fact:
              # Combine with existing zone(s)
              icinga_zones: "{{
                icinga_zones
                | combine({
                    icinga_zone_parent: {
                      'endpoints'
                      :
                      hostvars
                      | icinga_zone_endpoints(
                        icinga_zone_parent
                        ,
                        domain_name=ansible_facts.domain
                      )
                    }
                  })
              }}"
          when: ( icinga_zone_parent | default(None) ) is not none

        #   3. This system's child zones (might be none if this is not a
        #      master or satellite system).
        #
        # Currently doing a top -> bottom connection direction, so this
        # system should attempt to connect to all children which means
        # they should have their 'host' attribute set.
        - name: Add the system's child zones
          ansible.builtin.set_fact:
            # Combine with existing zone(s)
            icinga_zones: "{{
              icinga_zones
              | combine({
                  this_endpoint_zone
                  :
                  {
                    'endpoints'
                    :
                    hostvars
                    | icinga_zone_endpoints(
                        this_endpoint_zone
                        ,
                        add_host_key=True,
                        ,
                        domain_name=ansible_facts.domain
                    )
                    ,
                    'parent': endpoint.icinga_zone_parent
               #   }
              })
          }}"
          loop: "{{ hostvars.values() }}"
          loop_control:
            label: "{{ endpoint.inventory_hostname }}"
            loop_var: endpoint
          vars:
            this_endpoint_zone: "{% if 'icinga_zone' in endpoint %}{{ endpoint.icinga_zone }}{% elif 'ansible_facts' in endpoint and 'fqdn' in endpoint.ansible_facts %}{{ endpoint.ansible_facts.fqdn }}{% elif '.' in endpoint.inventory_hostname %}{{ inventory_hostname }}{% else %}{{ inventory_hostname }}.{{ ansible_facts.domain }}{% endif %}"
          when: >
            'icinga_zone_parent' in endpoint
            and
            endpoint.icinga_zone_parent == (
              icinga_zone | default(ansible_facts.fqdn)
            )

Using this to generate the zones.conf is then a trivial case of looping over the structure to list each zone and endpoint - those with a host option are the ones that are connected to (so top-down or bottom-up can be toggled by changing which list of endpoints the add_host_key option is set to True for).

Enabling the API feature

In my SaltStack configuration, I checked if the api feature was enabled with icinga2 feature list | grep '^Enabled features:.* api':

icinga2-api-enable:
  cmd.run:
    - name: "icinga2 feature enable api"
    - unless: "icinga2 feature list | grep '^Enabled features:.* api'"
    - require:
      - file: icinga2-api-config
    - watch_in:
      - service: icinga2-svc

In Ansible, this would be 2 stages - first to run the command and store the output (via register:) and then second to run the enable command if required. On Debian, at least, the enable command creates a symlink /etc/icinga2/features-enabled/api.conf so I used that instead. It remains to be seen if this is more brittle than using the output of the “proper” command to check for enabled features:

- name: Enable API feature
  ansible.builtin.command:
    cmd: /usr/sbin/icinga2 feature enable api
    creates: /etc/icinga2/features-enabled/api.conf

If I need to revert to using the command, I imagine some variant of registering the output of icinga2 feature list | grep '^Enabled features: (run as the nagios user) and when: "'api' in stdout_lines[0].split(':')[1].split(' ')" should work.

Certificates

Icinga uses PKI (“TLS certificates are mandatory for communication between nodes”) to confirm identity and securely communicate between nodes. My historic secret management has been poor (directly stored, usually unencrypted, in the SaltStack pillar data in a private Git repository), not wanting to repeat this with Ansible I am looking at HashiCorp Vault as a solution.

Another option I considered was to use a KeePass password safe but this would still require managing storing the safe itself, as well as the secrets to unlock it - despite historic bad decisions, I am now of the opinion that putting secrets, even encrypted, in a Git repository is generally bad idea. I also dismissed Ansible’s own vault for the same reason.

Deploying Vault is a topic that I finally picked up on 23rd January.