Adding a bastion host - migrating monitoring client roles to Ansible
This post is the fourth in the chain of posts starting with trying to get started with Ansible for managing my own infrastructure in October, heading down a rabbit-hole by trying to work around Ansible not playing nicely with 2 factor sudo authentication. It is the last in a series of three posts that I split out from the second in the series on the 2nd January 2023 and is the blog content I added yesterday explaining taking the monitoring role from SaltStack to Ansible.
As a recap, these are the salt roles from my states tree (nesting indicates where one role include another) I am migrating to Ansible, with the completed ones crossed through:
server
remotely-accessible
(installs and configures fail2ban)monitoring.client
(installs and configures icinga2 client)monitoring.common
(installs monitoring-plugins and nagios-plugins-contrib packages, installs & configures munin client and nagios’ kernel & raid checks.)
monitoring.server
(installs and configures icinga2 server, php support for nginx, icingaweb2 and munin server)webserver
(installs and configures nginx)monitoring.common
(see above)
Migrating monitoring
I merged monitoring.client
, monitoring.server
and monitoring.common
into a single monitoring
role which sets up iginca2 and munin for either client or server, depending on which component is selected via the role’s arguments (exactly as I did for the ssh
role). I also gave it a second entry point, munin-node-plugins
(by creating tasks/munin-node-plugins.yaml
and adding it to meta/argument_specs.yaml
) to make it easy to deploy/configure additional plugins without creating a separate munin
role. Depending on how this role grows, though, I may end up creating a monitoring
collection with separate roles for, e.g., Icinga, Munin etc. but at the moment combining them in one role seems manageable.
sudo role
Because the community.general.sudoers
Ansible module does not support host-based restrictions, I created my own role that does. It is very simple and behaves exactly like the module, overwriting the file entirely and using the same argument names, to make swapping back easy when the future releases add this support (host
and nopassword
default to ALL
and true
respectively via the role’s defaults/main.yaml
, to match the module):
---
- name: Ensure sudo is installed
become: yes
ansible.builtin.package:
name: sudo
- name: Create sudoers file
become: yes
ansible.builtin.copy:
dest: /etc/sudoers.d/{{ name }}
content: |
{{ user }} {{ host }}={% if runas | default(false) %}({{ runas }}){% endif %}{% if nopassword %}NOPASSWD: {% endif %} {{ commands | join(', ') }}
owner: root
group: root
mode: 0400
...
I then used this role to grant access required for the munin and icinga checks that need to run commands to do things their respective user’s cannot do.
munin-node
In order use Ansible’s community.general.dig
DNS lookup module to lookup the monitoring server’s IP from DNS (to put it in the munin-node daemon’s cidr_allow
list), I had to install dnspython
library.
Other than this, migrating munin-node was a straight duplication of my existing SaltStack role into Ansible.
Icinga
Each endpoint (i.e. node/host) in Icinga needs to know about the zone it is in (including all endpoints) and any endpoints it connects to or from, which means the parent and any child zones (including all of their endpoints). This is done in the zones.conf
file. This is explained in detail, with diagrams, in Icinga’s distributed monitoring documentation.
The original configuration
In my SaltStack configuration for Icinga’s zones file, I listed the endpoints, specifying their relationship in terms of which connects to which, and then the zones, including their relationship to each other, separately. This resulted in a bit of duplication (Icinga’s convention is to use the fully-qualified domain name [FQDN] for endpoint and zone name):
icinga2:
endpoints:
monitoring-server:
connect_to:
- vmhost
- backup-server
- command-and-control-server
- project-hosting-server
- vps
- router
vps:
connect_to:
# 'home' is an external address for the router, used to
# monitor if the house's internet connection is up from the
# vps' perspective.
- home
zones:
master:
endpoints:
- monitoring-server
vmhost:
parent: master
vps:
parent: master
backup-server:
parent: master
command-and-control:
parent: master
project-hosting-server:
parent: master
router:
parent: master
home:
parent: vps
At a configuration level, the connection direction is determined by which agent’s zone.conf
file has the host
attribute for an end point (the logic is simple - if the host
attribute is present, icinga will actively try to connect so master -> agent or agent -> master is simply a case of whether the master has all the agent’s host
set or the agent has the master’s host
set). For historical reasons (pre VPN, and the master is behind a NAT router), I have a master -> agent setup and have kept this for now but I am concious that with the monitoring master outside the server network, switching to agent -> master would reduce the new (intended to be more secure) server network’s attack surface as the icinga connections would become outbound rather than inbound.
Taking the opportunity of refreshing the configuration management to have a rethink about how the icinga configuration is specified, I noted:
- Zone names, with the exception of the
master
root zone, are always the FQDN of the system. - As described in the Icinga2 documentation, it is important to choose one connection direction.
- Connection direction dictates whether endpoints connect up (to the parent zone) or down (to the child zones) - in either case, the zone relationship alone is sufficient to determine what connects to what.
The Ansible configuration
Reflecting on this, I realised I can simplify the configuration data by:
- Optionally specifying zone name via a
icinga_zone
variable on the hosts, implicitly using the host’s name if it is missing - Specifying the zone’s parent (which might be none) via a variable at group level, rather than duplicate a list of hostnames (as zones) elsewhere
This did mean that I had to think a little about the group’s priority so that the parent is set correctly on the monitoring servers (no parent) compared to the other servers (parent is the master zone). My icinga configuration in my Ansible inventory now looks a little like this:
# Dummy group are either non-existent aliases or systems not managed
# by Ansible. The group exists to allow variables about those hosts to
# be added to the inventory so they may be consumed in Playbooks for
# e.g. monitoring and backup server configuration.
dummy:
hosts:
home:
icinga_zone_parent: vps
servers:
children:
backup_servers:
monitoring_servers:
ups_servers:
bastions:
vars:
icinga_zone_parent: master
monitoring_servers:
hosts:
monitoring-server:
vars:
# Must be higher than servers to override icinga_zone_parent for
# this group. Default priority is 1.
ansible_group_priority: 10
icinga_zone: master
# Monitoring servers have no parent (are in the master root zone)
icinga_zone_parent:
With this data structure, adding a new host to the servers group (or any of its sub-groups) now automatically adds it as a child of the master
Icinga monitoring zone, adding a new host to the monitoring_servers
group will automatically add it to the master
zone and clear the parent (as master
is the root). In effect, no configuration is now needed when adding a host (provided it is placed in the right group(s)) to configure monitoring on and of it.
Enabling persistent fact caching
As I have started using inventory variables on other hosts to configure the monitoring server, including using facts such as ansible_facts.fqdn
, I needed to turn on persistent fact caching so that the facts from other hosts persist when targeting specific hosts for configuration, for example, when updating the configuration of just the monitoring server. I enabled the bundled jsonfile
cache plugin, although I might look at the redis
one in future. jsonfile
requires a path configuring to store the cache files in, I told it to use ~/.local/cache/ansible-facts
(based on a StackExchange Q&A about the .local
directory). I enabled it by adding ansible.cfg
to my Ansible playbooks’ root:
[defaults]
fact_caching=ansible.builtin.jsonfile
fact_caching_connection=~/.local/cache/ansible-facts
# Disable cache timeout
fact_caching_timeout=0
Zone endpoints filter
I initially tried to create a list of endpoints in each zone from the hostvars
Ansible magic variable, which I was able to do but logic like “use ansible_facts.fqdn
if it exists or inventory_hostname
if it does not” is difficult to express in Jinja and resulted in a long piece that added together lists from filtering hostvars
4 times, once for each possibility (icinga_zone
present/missing, ansible_facts.fqdn
present/missing). In the end, I created a filter plugin instead, which was very straight forward - I just dropped the filter’s python file into a directory called filter_plugins
and was then able to use it. I followed the Ansible standard, as recommended on the developing plugins page including using /usr/bin/python
instead of the more compatible /usr/bin/env python
(although I wonder why they insist on including a shebang line in a plugin at all?):
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright 2022 Laurence Alexander Hurst
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)
DOCUMENTATION = """
name: icinga_zone_endpoints
author: Laurence Alexander Hurst
version_added: "1.0.0"
short_description: Get a dictionary of endpoints in the specified zone
description:
- Returns a dictionary of endpoints mapping to configuration information for those endpoints.
options:
_input:
description: hostvars (or equivalent map of hostnames -> data about those hosts)
type: dict
required: true
zone_name:
description: The zone name to find the endpoints for
type: str
required: true
add_host_key:
description: Whether to add the node's FQDN (or inventory_hostname if there is not FQDN fact) as `host` to each endpoint's configuration.
type: bool
default: false
domain_name:
description: Domain name to add to unqualified inventory_hostname values (if there is no '.' in it).
type: str
default: None
EXAMPLES: |
# To get the endpoints in the `master` zone
{{ hostvars | icinga_zone_endpoints('master') }}
RETURN:
_value:
description: A dictionary of endpoint name to configuration for each endpoint
type: dict
"""
class FilterModule(object):
def filters(self):
return {'icinga_zone_endpoints': self.icinga_zone_endpoints}
def icinga_zone_endpoints(self, hostvars, zone_name, add_host_key=False, domain_name=None):
result = {}
for host in hostvars.values():
host_node_name = None # Work out the host's Icinga endpoint name
host_zone_name = None # Work out the host's zone
# Do we have a FQDN fact for the node's name?
if 'ansible_facts' in host and 'fqdn' in host['ansible_facts']:
host_node_name = host['ansible_facts']['fqdn']
# if not, fall back to the inventory_hostname - should only apply
# to dummy hosts in the inventory, real hosts should have facts (
# if persistent fact cache is enabled).
elif '.' in host['inventory_hostname'] or domain_name is None:
host_node_name = host['inventory_hostname']
else:
host_node_name = "%s.%s" % (host['inventory_hostname'], domain_name)
# Is there an explicit zone set on this host?
if 'icinga_zone' in host:
host_zone_name = (host['icinga_zone'] == zone_name)
# if not, use the endpoint name as zone
else:
host_zone_name = host_node_name
if host_zone_name == zone_name:
result[host_node_name] = {'host': host_node_name} if add_host_key else {}
return result
I then used this to build a dictionary of zones and their endpoint, relevant to the current host, in a icinga_zones
(not to be confused with icinga_zone
, which holds the host’s zone) fact - the block is purely to logically keep the tasks together, it serves no functional purpose:
tasks:
- name: Build list of icinga zones relevant to this server
block:
# Presistent fact caching is required, so endpoint's fqdn facts
# are accessible even when only updating a subset of icinga
# nodes.
# We care about 3 groups of zones:
# 1. This system's zone (including any other endpoints in the same
# zone).
- name: Add this system's zone
ansible.builtin.set_fact:
# As this system's zone should be the first one set, it is
# fine to overwrite anything already in icinga_zones.
icinga_zones: "{{ {
icinga_zone | default(ansible_facts.fqdn)
:
{
'endpoints'
:
hostvars
| icinga_zone_endpoints(
icinga_zone | default(ansible_facts.fqdn)
,
domain_name=ansible_facts.domain
)
}
} }}"
# 2. This system's zone's parent zone (might be none if this is
# the master zone).
- name: Parent zone
block:
- name: Add this system's parent zone to its zone definition
ansible.builtin.set_fact:
# As this system's zone should be the first one set, it is
# fine to overwrite anything already in icinga_zones.
icinga_zones: "{{ {
icinga_zone | default(ansible_facts.fqdn)
:
icinga_zones[icinga_zone | default(ansible_facts.fqdn)]
| combine({'parent': icinga_zone_parent})
} }}"
- name: Add this system's parent zone
ansible.builtin.set_fact:
# Combine with existing zone(s)
icinga_zones: "{{
icinga_zones
| combine({
icinga_zone_parent: {
'endpoints'
:
hostvars
| icinga_zone_endpoints(
icinga_zone_parent
,
domain_name=ansible_facts.domain
)
}
})
}}"
when: ( icinga_zone_parent | default(None) ) is not none
# 3. This system's child zones (might be none if this is not a
# master or satellite system).
#
# Currently doing a top -> bottom connection direction, so this
# system should attempt to connect to all children which means
# they should have their 'host' attribute set.
- name: Add the system's child zones
ansible.builtin.set_fact:
# Combine with existing zone(s)
icinga_zones: "{{
icinga_zones
| combine({
this_endpoint_zone
:
{
'endpoints'
:
hostvars
| icinga_zone_endpoints(
this_endpoint_zone
,
add_host_key=True,
,
domain_name=ansible_facts.domain
)
,
'parent': endpoint.icinga_zone_parent
# }
})
}}"
loop: "{{ hostvars.values() }}"
loop_control:
label: "{{ endpoint.inventory_hostname }}"
loop_var: endpoint
vars:
this_endpoint_zone: "{% if 'icinga_zone' in endpoint %}{{ endpoint.icinga_zone }}{% elif 'ansible_facts' in endpoint and 'fqdn' in endpoint.ansible_facts %}{{ endpoint.ansible_facts.fqdn }}{% elif '.' in endpoint.inventory_hostname %}{{ inventory_hostname }}{% else %}{{ inventory_hostname }}.{{ ansible_facts.domain }}{% endif %}"
when: >
'icinga_zone_parent' in endpoint
and
endpoint.icinga_zone_parent == (
icinga_zone | default(ansible_facts.fqdn)
)
Using this to generate the zones.conf
is then a trivial case of looping over the structure to list each zone and endpoint - those with a host
option are the ones that are connected to (so top-down or bottom-up can be toggled by changing which list of endpoints the add_host_key
option is set to True
for).
Enabling the API feature
In my SaltStack configuration, I checked if the api
feature was enabled with icinga2 feature list | grep '^Enabled features:.* api'
:
icinga2-api-enable:
cmd.run:
- name: "icinga2 feature enable api"
- unless: "icinga2 feature list | grep '^Enabled features:.* api'"
- require:
- file: icinga2-api-config
- watch_in:
- service: icinga2-svc
In Ansible, this would be 2 stages - first to run the command and store the output (via register:
) and then second to run the enable command if required. On Debian, at least, the enable command creates a symlink /etc/icinga2/features-enabled/api.conf
so I used that instead. It remains to be seen if this is more brittle than using the output of the “proper” command to check for enabled features:
- name: Enable API feature
ansible.builtin.command:
cmd: /usr/sbin/icinga2 feature enable api
creates: /etc/icinga2/features-enabled/api.conf
If I need to revert to using the command, I imagine some variant of registering the output of icinga2 feature list | grep '^Enabled features:
(run as the nagios
user) and when: "'api' in stdout_lines[0].split(':')[1].split(' ')"
should work.
Certificates
Icinga uses PKI (“TLS certificates are mandatory for communication between nodes”) to confirm identity and securely communicate between nodes. My historic secret management has been poor (directly stored, usually unencrypted, in the SaltStack pillar data in a private Git repository), not wanting to repeat this with Ansible I am looking at HashiCorp Vault as a solution.
Another option I considered was to use a KeePass password safe but this would still require managing storing the safe itself, as well as the secrets to unlock it - despite historic bad decisions, I am now of the opinion that putting secrets, even encrypted, in a Git repository is generally bad idea. I also dismissed Ansible’s own vault for the same reason.
Deploying Vault is a topic that I finally picked up on 23rd January.