Orchestrating Debian install and automated post-install configuration with Ansible
In the exciting* conclusion of the process of using Ansible to find dynamic client IPs and solving the PXE/iPXE/Debian installer/dropbear/OS identity crisis, I use this work to automate the post-install configuration of systems installed using my generic Debian install preseed. In this post I will be going through the install Ansible playbook (including some tweaks to the Debian installer preseed file), post-install configuration (bootstrap) playbook, remote unlocking playbook and reinstall playbook.
(*other opinions are available)
Install playbook
This playbook (which I called install.yaml
) installs a new host. Currently it expects the host to be manually turned on. It presumes that the install is fully automated and that, once complete, the host will boot into the new OS with SSH listening. In the future, I intend to make the PXE configuration more dynamic (setting the specific host to PXE boot into the automated install, for example) which will result in some changes being needed to this.
For now I have manually set my development host to PXE boot directly into the pre-seeded Debian install via a MAC-specific override per my previous posts on targeted PXE booting and improved iPXE configuration. As this is destructive and the host-specific change to the DHCP configuration alters the normal behaviour of DHCP, which changes how it behaves compared to the standard configuration, I have crafted the playbook to only run if the special variable INSTALL_HOSTS
is defined (it is used as the target host or hosts for the playbook). I have not set this variable anywhere in the inventory, it is explicitly specified via the -e
ansible-playbook
command line option.
As the install is automatic, if the host boots the pre-seeded installer option, essentially all this playbook does is configure the DHCP server and wait for the install to complete.
It configures the DHCP server to ignore the clint id for the MAC of the host being installed, which at present must be already known via a variable, which can be set on the host in the inventory. I have not explored whether there are alternative ways to discover the MAC of the host, e.g. if the switch port of the host is known, and still make a workable process. It then waits for the install to complete by waiting for SSH to be listening on the target host. Once the OS is installed, I reverted the changes to the DHCP server to restore the previous (standard) behaviour. I used the method of reconfiguring the DHCP server as I believe this is the most robust way; it supports installing via multiple methods (PXE, as I am doing here, boot from USB or CD etc.), each of which would cause a different number of DHCP requests to occur, as well as coping if different NICs behave differently regarding the client id they report and/or the number of times they DHCP before iPXE gets loaded.
Once the DHCP server is configured to ignore the client id, the IP address will not change (once one is assigned) due to one of the network (PXE) boot loader (at least on the machines I tested on, this is part of the NIC’s firmware so others could behave differently), iPXE or the Debian installer issuing a DHCP release - so each new DHCP request just gets issued the (from the server’s perspective) existing IP leased. The only time I foresee this not working is a race condition that would only work if the DHCP lease time was extremely short (a few seconds). In this instance, an existing lease might expire in one of the transitions between the parts (PXE/iPXE/installer/installed OS) of the process (within any one of these, the lease should be renewed by the DHCP client once ½ of the lease time has expired).
---
- hosts: localhost
# Don't use facts, so save some time by not bothering to gather them.
gather_facts: false
any_errors_fatal: true
tasks:
# If this task fails, Ansible will abort the whole playbook and not
# run subsequent plays on any host.
- ansible.builtin.assert:
that: INSTALL_HOSTS is defined
fail_msg: Set host to be installed in variable INSTALL_HOSTS - note this action maybe destructive!
- hosts: '{{ INSTALL_HOSTS }}'
# Cannot guarantee host is up yet.
gather_facts: false
tasks:
# How else can the client be identified? Could we look up the MAC from a switch or existing DHCP lease, perhaps? - probably one for the reinstall script rather than install?
- name: MAC is known for the client
ansible.builtin.assert:
that: mac_address is defined
fail_msg: No mac_address fact/variable set for {{ inventory_hostname }}
# All installs going to be done via DHCP (as the installer only DHCPs) so need to configure DHCP server in all cases.
- name: DHCP server ignores client id (so IP won't change during install if using DHCP)
# * uses host's `mac_address` variable
# * uses `all` group `dhcp_server_host` variable
ansible.builtin.include_role:
name: dhcp
tasks_from: host-ignore-client-id
- name: Configuration changes have been applied
ansible.builtin.meta: flush_handlers
# XXX TODO - use pdu to power on? What if already on (e.g. as part of reinstall)?
# XXX Probably need to check it's off as a precondition if using PDU to trigger boot?
- name: DHCP IP address is known
# * uses host's `mac_address` variable
# * uses `all` group `dhcp_server_host` variable
ansible.builtin.include_role:
name: dhcp
tasks_from: lookup-host
vars:
wait_for_lease: true
- name: DHCP IP address is being used for ansible_host
block:
- ansible.builtin.set_fact:
ansible_host: "{{ dhcp_lease.ip }}"
- ansible.builtin.debug:
msg: After host discovery, ansible_host is set to '{{ ansible_host }}'.
- name: SSH daemon is up (which means auto-install has finished)
delegate_to: localhost
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 22
# This could take a long time - wait upto 30 minutes (1800 seconds)
timeout: 1800
- name: New SSH host key is known
delegate_to: localhost
ansible.builtin.command: /usr/bin/ssh-keyscan -H -T10 -tecdsa "{{ hostvars[inventory_hostname]['ansible_host'] | default(inventory_hostname) }}"
register: new_ssh_hostkey
changed_when: false # Always a read operation, never changes anything
- name: No old keys are in local known_hosts
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
ansible.builtin.known_hosts:
name: '{{ hostvars[inventory_hostname].ansible_host | default(inventory_hostname) }}'
state: absent
- name: No old keys are in local known_hosts
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
ansible.builtin.known_hosts:
name: '{{ dhcp_lease.ip }}'
state: absent
- name: New host key is in local known_hosts
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
ansible.builtin.known_hosts:
name: '{{ hostvars[inventory_hostname].ansible_host | default(inventory_hostname) }}'
key: '{{ new_ssh_hostkey.stdout }}'
- name: New host key is in local known_hosts
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
ansible.builtin.known_hosts:
name: '{{ dhcp_lease.ip }}'
key: '{{ new_ssh_hostkey.stdout }}'
- name: DHCP server no longer ignores client id
# * uses host's `mac_address` variable
# * uses `all` group `dhcp_server_host` variable
ansible.builtin.include_role:
name: dhcp
tasks_from: host-unignore-client-id
# Proceed to bootstrap new host
- name: New host is bootstrapped
ansible.builtin.import_playbook: bootstrap.yaml
vars:
BOOTSTRAP_HOSTS: "{{ INSTALL_HOSTS }}"
- name: Dynamic IPs are removed from local known hosts
hosts: '{{ INSTALL_HOSTS }}'
gather_facts: false # Won't need them
tasks:
- name: No old keys are in local known_hosts
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
# dhcp_lease.ip was found earlier
ansible.builtin.known_hosts:
name: '{{ dhcp_lease.ip }}'
state: absent
...
Preseed improvements
Initial root password
Deciding on an initial root password
Although I did not specify it in my previous post, I was using a very simple (equivalent to the common r00tme
example) root password in my initial preseed, which was convenient for developing up to this point.
I considered a number of options to improve this initial position, all set via the preseed file:
- Use a complex but static initial root password
- Not specify a root password (setting
d-i passwd/root-login
tofalse
), but instead pre-create anansible
user with a complex static initial password (the Debian installer will give the user sudo rights) - Generate a unique complex root password per install
- Not specify a root password but pre-create an
ansible
user with a unique complex password per install
At the moment, the preseed file is static and changing that increases the complexity of the solution (to generate per-install anything requires generating a preseed file as part of the install process). So far, this set up is designed to keep the initial install generic and do customisation using a configuration management tool (currently Ansible). Generating custom, per-install, preseed file blurs this boundary. For these reasons I decided not to do any per-install password setting in the preseed, despite being a more secure posture.
Having ruled out per-install passwords, I considered whether to configure an initial root password or disable root login and pre-create an ansible
user. As the installer will give that user unrestricted sudo
access, out of the box (password login, no ssh-key, so single factor login) this provides limited security advantage over the use of root directly (the superuser being ansible
rather than root
is an example of security through obscurity).
I also considered that using a static password in the preseed does not preclude rotating that password (the only constraint being that no installs can be in progress at the time of rotation) periodically to improve security.
Generating the initial root password
I generated a password (using my previous quick and dirty hack to generate the password) and stored it in /kv/install/initial_root_password
in the vault:
PASS_FILE=$(mktemp)
chmod 600 ${PASS_FILE} # Be sure only we can read it
tr -dc '[:print:]' < /dev/urandom | head -c32 > ${PASS_FILE}
vault kv put -mount=kv /install/initial_root_password password=@${PASS_FILE}
rm -f ${PASS_FILE}
To generate the crypted hash (for the preseed file), with bash (to set it statically for now):
vault read -field=password /kv/install/initial_root_password | mkpasswd -m sha512crypt -R 656000 -s
The -R 656000
just matches Ansible’s ansible.builtin.password_hash
filter’s default.
or, with Ansible (which could be useful to make it template the preseed file, for password rotation) - note I improved on this when I refreshed the PXE configuration after writing this bit but before publishing this post:
{{ lookup('community.hashi_vault.vault_read', 'kv/install/initial_root_password').data.password | ansible.builtin.password_hash }}
Not sending a client id with DHCP requests
Post configuration, there is a discrepancy between the DHCP request issued during the initramfs environment’s DHCP for remote unlocking and the one issued by the OS after the encryption has been successfully unlocked. Specifically, the OS client (dhclient by default) requests with a client id and the initramfs client does not which causes an RFC2131 compliant server to treat them as different clients there therefore issue different dynamic addresses in response to each request.
The OS client can be configured to not send a client id by adding client no
to the interface’s configuration in /etc/network/interfaces
which then means both DHCP clients are treated as being the same machine by the server. This results in a single IP address being issued to both, so it does not change between the encryption unlock and the OS being fully loaded and that one system does not consume an extra address from the pool (until the lease on the first expires - having already proven empirically that an DHCP release is not issued on the first address).
Setting configuration with preseed
Using the same pattern I used before, I created a script called preseed-no-dhcp-client-id
that adds commands to the preseed/late-command
script that gets built up in /tmp
(very meta) that gets run after the install is complete but before the system is rebooted:
#!/bin/sh
# ^^ no bash in installer environment, only BusyBox
# Die on error
set -e
cat - >>/tmp/late-command-script <<EOF
## BEGIN ADDED BY preseed-no-dhcp-client-id preseed/include_command
in-target /usr/bin/sed -i '/^iface .* inet dhcp$/a # Do NOT send client-id, to match initramfs behaviour so IP dies not\n# change between initramfs and OS.\nclient no' /etc/network/interfaces
## END ADDED BY preseed-no-dhcp-client-id preseed/include_command
EOF
This is copied to the preseed web-server by the existing Ansible playbook (just another filename in the loop) and then added to the pressed/include-command
script in the preseed configuration file itself:
d-i preseed/include_command string \
for file in \
preseed-script-headers \
preseed-crypto-key \
preseed-ssh-setup \
preseed-remove-dummy-lv \
preseed-no-dhcp-client-id \
; do \
wget -P /tmp $( dirname $( debconf-get preseed/url ) )/$file && \
chmod 500 /tmp/$file && \
/tmp/$file; \
done;
For the future, this (the list of files) should be turned into a single variable that is used for both the files to push out to the web-server and the loop in the preseed configuration file (so there is one list to maintain).
Post install configuration
The post install Ansible playbook is what does all of the configuration, the base OS install is entirely generic. The playbook consists of a number of plays, which I will go through step by step.
The preseed deploys some keys, from the HashiCorp Vault, so that initially Ansible can login as root without a password to do the bootstrap.
1. Check that BOOTSTRAP_HOSTS is set
Unlike the install playbook, it should be relatively safe (if not thoroughly tested) to (re)run the bootstrap against a provisioned host so the requirement that a specific host is targeted (via the BOOTSTRAP_HOSTS
variable, similar to the INSTALL_HOSTS
used before) is set would not be necessary. However several tasks, before the ansible
user is created, are hardcoded to use the root
user which will fail once SSH is secured to prevent remote root logins. These tasks are tagged root
, so they can be skipped, but that requires intimate knowledge of the process and is intended for debugging/troubleshooting purposes - e.g. if the play does not complete. The install playbook sets BOOTSTRAP_HOSTS
to INSTALL_HOSTS
when including the bootstrap playbook.
- hosts: localhost
# Don't use facts, so save some time by not bothering to gather them.
gather_facts: false
any_errors_fatal: true
tasks:
# If this task fails, Ansible will abort the whole playbook and not
# run subsequent plays on any host.
- ansible.builtin.assert:
that: BOOTSTRAP_HOSTS is defined
fail_msg: Set host to be bootstrapped in variable BOOTSTRAP_HOSTS - note this playbook will fail on already bootstrapped hosts!
2. Install python
In order for Ansible modules to work, Python needs to be installed. This is not present by default on a minimal Debian install, such as the one done by my preseed, so installing this (using the ansible.builtin.raw
module that does not require Python) is the first thing that needs to happen:
- name: Python is available
hosts: '{{ BOOTSTRAP_HOSTS }}'
# Fact gathering will fail if no python on remote host
gather_facts: false
tags: root
vars:
ansible_become_method: su
# Keys should let us in as root - ssh password auth will be
# disabled by default.
ansible_user: root
tasks:
- name: Python is installed
become: true
# XXX This assumes Debian - need to be cleverer for other OSs
# Redirects all apt-get output to stderr, so it can be seen if a
# failure happens but stdout is only 'Present', 'Installed' or
# 'Failed'
ansible.builtin.raw: bash -c '(which python3 && echo "Present") || (apt-get -y install python3 && echo "Installed") || echo "Failed"'
register: python_install_output
# Changed when we had to install python
changed_when: python_install_output.stdout_lines[-1] == 'Installed'
# Failed if python wasn't there and (in the logical sense) it
# didn't install.
failed_when: python_install_output.stdout_lines[-1] not in ['Installed', 'Present']
3. Set the hostname
The Debian installer configures the LVM volume group to be named according to the hostname. This is a design decision that I like, it means that, e.g. for recovery, a disk can be easily placed into another Linux system without worrying about clashing volume group names - a situation that means more hoops have to be jumped through (to rename the volume group) to make the logical volume accessible, which I have experienced with RedHat family systems. However, this means that when I change the hostname I want to also rename the volume group to match (the preseed builds the host with the hostname unconfigured-hostname
).
Before writing the playbook, I manually worked out the process required:
- Change hostname:
hostnamectl hostname test
-
Update
/etc/hosts
with the new local hostname:sed -i 's/127.0.1.1.*$/127.0.1.1 test.dev.internal test/' /etc/hosts
- Reboot for the correct (new) hostname to be logged in
/etc/lvm/backup
when the VG is renamed -
Rename the volume group, based on https://wiki.debian.org/LVM#Renaming_a_volume_group:
vgrename
needs to be done while logged directly in as root as it causes /home to unmount. This is why it is done at this stage of the process, before securing ssh (below) as that disables remote root.-
Rename the volume group:
vgrename unconfigured-hostname-vg vg_$( hostname -s )
-
Create symlinks to the old logical volume names (or the system will not be able to reboot):
cd /dev/mapper for lv in /dev/mapper/vg_$( hostname -s | sed 's/-/--/g' )-* do ln -s "${lv##*/}" "/dev/mapper/unconfigured--hostname--vg-${lv##*-}" done
-
Update paths of filesystems in
/etc/fstab
:sed -i "s#unconfigured--hostname--vg#vg_$( hostname -s | sed 's/-/--/g' )#g" /etc/fstab
-
Update path to resume partition in
/etc/initramfs-tools/conf.d/resume
:sed -i "s#unconfigured--hostname--vg#vg_$( hostname -s | sed 's/-/--/g' )#g" /etc/initramfs-tools/conf.d/resume
-
Update paths to filesystems in
/boot/grub/grub.cfg
(I think this is just the root filesystem?):sed -i "s#unconfigured--hostname--vg#vg_$( hostname -s | sed 's/-/--/g' )#g" /boot/grub/grub.cfg
-
Update initramfs with the new paths:
update-initramfs -c -k all
-
Reboot with the new paths:
reboot
-
Update grub (which will fail before the reboot because it uses the mounted rather than the configured filesystems):
update_grub
-
-
Finally, just for neatness, update the hostname in the comments in the hosts’s ssh keys:
sed -i "s#unconfigured-hostname#$( hostname -s )#g" /etc/ssh/ssh_host_*.pub
The Ansible play follows this exact same process:
- name: Hostname is correct
hosts: '{{ BOOTSTRAP_HOSTS }}'
tags: root
vars:
ansible_become_method: su
# Keys should let us in as root - ssh password auth will be
# disabled by default.
ansible_user: root
handlers:
- name: Reboot
become: true
ansible.builtin.reboot:
tasks:
- name: Hostname is set
become: true
ansible.builtin.hostname:
name: '{{ inventory_hostname }}'
notify: Reboot
- name: Hostname is correct in /etc/hosts
become: true
ansible.builtin.lineinfile:
# Use dns.domain which is the domain from the DHCP server.
# I do not know how reliable this is to guarantee the domain can be found?
line: 127.0.1.1 {{ inventory_hostname }}.{{ ansible_facts.dns.domain }} {{ inventory_hostname }}
path: /etc/hosts
regexp: '^127.0.1.1\s'
state: present
notify: Reboot
- name: Hostname changes have applied
ansible.builtin.meta: flush_handlers
# LVM rename process based on https://wiki.debian.org/LVM#Renaming_a_volume_group
- name: LVM Volume group is named correctly
become: true
# Should look at community.general.lvg_rename but not in the
# version of community.general (5.8.0) installed.
ansible.builtin.command:
# N.B. vgrename needs to be done while logged directly in as
# root as it causes /home to unmount (so pre-securing ssh).
argv:
- /usr/sbin/vgrename
- unconfigured-hostname-vg
- vg_{{ inventory_hostname }}
when: "'unconfigured-hostname-vg' in ansible_facts.lvm.vgs"
register: lvm_rename
notify: Reboot
- name: Update LVM VG dependent configuration
block:
- name: List of logical volumes is known
ansible.builtin.find:
paths: /dev/mapper/
file_type: any # Default 'file' doesn't match symlinks
patterns: "vg_{{ inventory_hostname | replace('-', '--') }}-*"
register: lv_names
- name: Links for old volume group names exist
become: true
ansible.builtin.file:
path: "{{ item.path | replace('/vg_' + (inventory_hostname | replace('-', '--')) + '-', '/unconfigured--hostname--vg-') }}"
src: '{{ item.path }}'
state: link
loop: '{{ lv_names.files }}'
- name: Configuration files are correct
become: true
ansible.builtin.replace:
path: '{{ item }}'
regexp: unconfigured--hostname--vg
replace: vg_{{ inventory_hostname | replace('-', '--') }}
loop:
- /etc/fstab
- /etc/initramfs-tools/conf.d/resume
- /boot/grub/grub.cfg
notify: Reboot
- name: Initramfs is correct
become: true
ansible.builtin.command:
argv:
- /usr/sbin/update-initramfs
- -c
- -k
- all
notify: Reboot
- name: Logical volume changes have applied
ansible.builtin.meta: flush_handlers
- name: Grub configuration is up to date
become: true
ansible.builtin.command: /usr/sbin/update-grub
when: lvm_rename.changed
- name: List of ssh host public key files is known
ansible.builtin.find:
paths: /etc/ssh
file_type: any # Shouldn't be symlinks but why risk it?
patterns: ssh_host_*.pub
register: ssh_host_keys
- name: Hostname is correct in host key comments
become: true
ansible.builtin.replace:
path: '{{ item.path }}'
regexp: unconfigured-hostname
replace: '{{ inventory_hostname }}'
loop: '{{ ssh_host_keys.files }}'
4. Secure user accounts
This next play installs sudo
, adds an ansible
user account with permission to use sudo
, sets up some ssh keys with access to the ansible
user and generates, stores and sets both the ansible
and root
account passwords to random, host-specific, values. The passwords are set using the recipe from my previous post, which uses a random seed based on the hostname to make it idempotent (otherwise a new encrypted value is generated on each run so the password is always updated even though the password itself is unchanged).
- name: User accounts are secured
hosts: '{{ BOOTSTRAP_HOSTS }}'
tags: root
vars:
ansible_become_method: su
# Keys from install process should let us in as root - ssh
# for root will be `prohibit-password` by default.
ansible_user: root
tasks:
- name: Sudo is installed
become: true
ansible.builtin.package:
name: sudo
state: present
- name: New passwords are stored in vault
delegate_to: localhost
community.hashi_vault.vault_write:
path: kv/hosts/{{ inventory_hostname }}/users/{{ item }}
data:
password: "{{ lookup('ansible.builtin.password', '/dev/null', chars=['ascii_letters', 'digits', 'punctuation']) }}"
loop:
- root
- ansible
- name: New passwords are set on remote host
become: true
ansible.builtin.user:
name: "{{ item }}"
password: "{{ lookup('community.hashi_vault.vault_read', 'kv/hosts/' + inventory_hostname + '/users/' + item).data.password | ansible.builtin.password_hash('sha512', 65534 | random(seed=inventory_hostname) | string) }}"
loop:
- root
- ansible
- name: Ansible user can sudo
become: true
ansible.builtin.user:
name: ansible
append: true
groups:
- sudo
- name: Ansible user ssh keys are correct
ansible.posix.authorized_key:
user: ansible
# Use the same keys that allow root in for now - probably need to revisit this?
key: "{{ lookup('community.hashi_vault.vault_read', 'kv/install/initial_root_keys').data.ssh_keys }}"
state: present
5. Secure SSH
Until this point, my playbook has been hard-coded to log in as root
and using su
as the escalation method. Once the ansible
user is created, the default settings in my inventory should begin working - to be sure, this play begins with a sanity check that Ansible can become root without any play-specific settings. This to try and avoid accidentally locking myself out of the system by locking it down if all is not well. Provided that it can become root, the play then disables directly logging in as root as well as interactive login methods so only key-based access will work.
- name: Lockdown SSH
hosts: '{{ BOOTSTRAP_HOSTS }}'
handlers:
- name: Restart sshd
become: true
ansible.builtin.service:
name: sshd
state: restarted
tasks:
- name: Can login and escalate using default (should be `ansible` user) credentials (sanity check)
become: true
ansible.builtin.command: /usr/bin/whoami
changed_when: false
register: sanity_check_output
- name: Assert successfully became root (sanity check - belt and braces)
ansible.builtin.assert:
that:
- sanity_check_output.stdout == 'root'
# Can definitely now login and escalate to root - proceed to
# disable root login and prohibit password ssh logins.
- name: Root ssh login is denied
become: true
ansible.builtin.lineinfile:
line: PermitRootLogin no
path: /etc/ssh/sshd_config
regexp: '^#?PermitRootLogin\s'
state: present
notify: Restart sshd
- name: Password authentication is denied
become: true
ansible.builtin.lineinfile:
line: PasswordAuthentication no
path: /etc/ssh/sshd_config
regexp: '^#?PasswordAuthentication\s'
state: present
notify: Restart sshd
- name: Challenge response authentication is denied (Debian <12)
become: true
ansible.builtin.lineinfile:
line: ChallengeResponseAuthentication no
path: /etc/ssh/sshd_config
regexp: '^#?ChallengeResponseAuthentication\s'
state: present
notify: Restart sshd
# Changed to KbdInteractiveAuthentication in Bookworm (12)
when: ansible_facts.distribution == 'Debian' and (ansible_facts.distribution_major_version | int) < 12
- name: Challenge response authentication is denied (Debian 12+)
become: true
ansible.builtin.lineinfile:
line: KbdInteractiveAuthentication no
path: /etc/ssh/sshd_config
regexp: '^#?KbdInteractiveAuthentication\s'
state: present
notify: Restart sshd
# Changed from ChallengeResponseAuthentication in Bookworm (12)
when: ansible_facts.distribution == 'Debian' and (ansible_facts.distribution_major_version | int) >= 12
6. Setup LUKS encryption passphrase and remote unlocking
During the preseeded install LUKS encryption is setup with a dynamically generated key embedded in the unencrypted initial ram disk. Prior to any data ends up on the system, this key needs to be replaced with a per-system key which is not stored on the system and the existing key revoked. So that the unencrypted boot partition, or anything on the system, is sufficient to unlock the encryption.
In addition to replacing the encryption key, the play installs dropbear ssh server and initramfs integration that provides an early boot SSH service which can be used to remotely unlock the encryption during the boot process.
I initially setup dropbear with a late_command
in the preseed file which configured an ssh key which allowed remote unlocking:
# Dropbear options:
# -I 600 - Disconnect session after 60s of inactivity
# -j - Disable local port forwarding
# -k - Disable remote port forwarding
# -p 2222 - Listen on port 2222
# -s - Do not allow password authentication (won't work anyway in the initramfs environment)
# /etc/dropbear/initramfs/dropbear.conf was /etc/dropbear-initramfs/config in Bullseye - moved in Bookworm
d-i preseed/late_command string \
in-target sed -i 's/^#?DROPBEAR_OPTIONS=.*$/DROPBEAR_OPTIONS="-I 600 -j -k -p 2222 -s"/' /etc/dropbear/initramfs/dropbear.conf && \
in-target mkdir -m 700 -p /root/.ssh && \
in-target /usr/bin/sh -c 'echo \'ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBGELav9hG7S1Kohs5QyEsrBIXLbT18tdTZCFg5rUITwxXg1JDKlzuR7v+8zLmbzWCBs0IR8QA9EBw0099h8QW3A= laurence@core\' > /root/.ssh/authorized_keys' && \
in-target cp /root/.ssh/authorized_keys /etc/dropbear/initramfs/authorized_keys && \
in-target update-initramfs -u;
This command:
- Configured dropbear-initramfs’s options for the early initramfs ssh daemon (to allow remove unlocking of the encrypted volume)
- Creates root’s
.ssh/authorized_keys
file (to allow root to login with that key to perform initial configuration of the system) - Copy’s root’s
.ssh/authorized_keys
file to dropbear-initramfs’sauthorized_keys
to allow that key to also login to unlock - Runs
update-initramfs
to ensure the dropbear changes are incorporated in the initial ram disk image
Making dropbear listen on a different port serves a couple of purposes:
- For automation, being on a different port allows (e.g.) Ansible to easily tell if the system is waiting to be unlocked or the “normal” ssh daemon is listening post boot
- The dropbear service presents different host keys to the main OS ssh daemon and having it on a different port allows this to be managed by clients easily
- Allows the dropbear ssh service, for remote unlocking, to be firewalled differently (due to being a different port, it is easy to be more restrictive which remote systems can access it in external (to this system) firewalls)
My play configures dropbear the same way. It uses community.crypto.luks_device
, so the community.crypto
collection needs adding to requirments.yaml
and installing with ansible-galaxy collection install -r requirements.yaml
if not already present:
- name: Disk encryption is setup for remote unlocking
hosts: '{{ BOOTSTRAP_HOSTS }}'
handlers:
- name: Current initramfs is updated
become: true
ansible.builtin.command: /usr/sbin/update-initramfs -u
- name: Old dropbear host key is deleted
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
ansible.builtin.known_hosts:
name: '[{{ hostvars[inventory_hostname].ansible_host | default(inventory_hostname) }}]:2222'
state: absent
- name: Existing dhclient lease is cleared
become: true
ansible.builtin.file:
path: /var/lib/dhcp/dhclient.{{ ansible_facts.default_ipv4.interface }}.leases
state: absent
tasks:
- name: dropbear is installed
become: true
ansible.builtin.package:
name: dropbear-initramfs
state: present
- name: Unlock ssh keys are deployed
become: true
ansible.posix.authorized_key:
user: root
path: /etc/dropbear/initramfs/authorized_keys
manage_dir: false
# Use the same keys that allow root in for now - probably need to revisit this?
key: "{{ lookup('community.hashi_vault.vault_read', 'kv/install/initial_root_keys').data.ssh_keys }}"
state: present
notify: Current initramfs is updated
- name: Dropbear configuration is correct
become: true
ansible.builtin.lineinfile:
line: DROPBEAR_OPTIONS="-I 600 -j -k -p 2222 -s"
path: /etc/dropbear/initramfs/dropbear.conf
regexp: '^#?DROPBEAR_OPTIONS='
state: present
notify:
- Current initramfs is updated
- Old dropbear host key is deleted
- name: Installer LUKS unlock key file is `stat`ed
become: true
ansible.builtin.stat:
path: /etc/keys/luks-lvm.key
register: luks_key_stat
- name: LUKS passphrase is setup
block:
- name: Block device and filesystem types are known
ansible.builtin.command: /usr/bin/lsblk -o PATH,FSTYPE -J
register: block_path_type_json
# Read only operation - never changes anything
changed_when: false
- name: Encrypted block devices are known
ansible.builtin.set_fact:
encrypted_block_devices: >-
{{
(block_path_type_json.stdout | from_json).blockdevices
|
selectattr('fstype', 'eq', 'crypto_LUKS')
|
map(attribute='path')
}}
- name: Only one encrypted device exists
ansible.builtin.assert:
that:
- encrypted_block_devices | length == 1
- name: Encrypted device name is stored
ansible.builtin.set_fact:
encrypted_block_device: "{{ encrypted_block_devices | first }}"
- name: New passphrase is stored in vault
delegate_to: localhost
community.hashi_vault.vault_write:
path: kv/hosts/{{ inventory_hostname }}/luks/passphrase
data:
passphrase: "{{ lookup('ansible.builtin.password', '/dev/null', chars=['ascii_letters', 'digits', 'punctuation'], length=40) }}"
- name: New passphrase is set
become: true
community.crypto.luks_device:
new_passphrase: "{{ lookup('community.hashi_vault.vault_read', 'kv/hosts/' + inventory_hostname + '/luks/passphrase').data.passphrase }}"
keyfile: /etc/keys/luks-lvm.key
device: "{{ encrypted_block_device }}"
- name: Installer generated key file is removed for unlocking
become: true
community.crypto.luks_device:
remove_keyfile: /etc/keys/luks-lvm.key
passphrase: "{{ lookup('community.hashi_vault.vault_read', 'kv/hosts/' + inventory_hostname + '/luks/passphrase').data.passphrase }}"
device: "{{ encrypted_block_device }}"
- name: Installer generated key file is removed from crypttab
become: true
ansible.builtin.replace:
path: /etc/crypttab
regexp: '/etc/keys/luks-lvm\.key'
replace: 'none'
# Needs to be correct in initramfs or cryptroot-unlock will not prompt for passphrase
notify: Current initramfs is updated
- name: Installer generated key file is removed from disk
become: true
ansible.builtin.file:
path: /etc/keys/luks-lvm.key
state: absent
# Key file will need removing from initramfs
notify: Current initramfs is updated
- name: No keys are copied to initramfs
become: true
ansible.builtin.lineinfile:
line: '#KEYFILE_PATTERN='
path: /etc/cryptsetup-initramfs/conf-hook
regexp: '^#?KEYFILE_PATTERN='
state: present
notify: Current initramfs is updated
when: luks_key_stat.stat.exists
7. Test remote unlocking
Finally, before the playbook ends, I reboot the target and test the remote unlock works - this verifies that the unlock passphrase is correct in the vault and the whole process works before any data that might possibly be important is written inside the encrypted filesystem (at this point everything could be recreated by re-running the install and bootstrap process).
At the moment the encryption unlocking is a playbook in its own right, which has to be included as its own play (ansible.builtin.import_playbook
can only be at the top/play level). It (and a number of other things, like creating/refreshing the luks passphrase) would be better as tasks in a role but this is left as an exercise for the future.
- name: Host is rebooted to test unlock
hosts: '{{ BOOTSTRAP_HOSTS }}'
tasks:
- name: Host is rebooted
become: true
# Ansible's reboot command waits, and checks, for the host to
# come back, which will never happen. Even with async (see below)
# there is a race condition if the ssh connection gets closed (by
# the shutdown process) before Ansible has disconnected so it is
# necessary to delay the shutdown command by longer than the
# async value, in order to avoid this problem.
ansible.builtin.shell: 'sleep 2 && /usr/bin/systemctl reboot --message="Ansible triggered reboot for LUKS unlock sanity check."'
# Run in background, waiting 1 second before closing connection
async: 1
# Launch in fire-and-forget mode - with a poll of 0 Ansible skips
# polling entirely and moves to the next task, which is precisely
# what we need.
poll: 0
# XXX turn this into either tasks or role, so it can be part of the above play.
- name: Check remote unlock is working
ansible.builtin.import_playbook: unlock-crypt.yaml
vars:
UNLOCK_HOSTS: '{{ BOOTSTRAP_HOSTS }}'
Remotely unlocking the encrypted root filesystem
Once I had dropbear setup, before I finished the bootstrap playbook I developed a playbook to unlock systems remotely using their passphrase from the vault. The dropbear environment does not have a python interpretor so most Ansible modules will not work directly.
My first attempt at doing this was to use the ansible.builtin.raw
module to run the unlock command on the remote system, however that module does not support passing a value for stdin so some other means has to be used to pass the unlock key to the unlock command. Other modules, including ansible.builtin.copy
, require python on the remote system (which isn’t available inside the initramfs) which makes it harder.
---
- hosts: all
gather_facts: false
tasks:
- name: Wait for dropbear ssh daemon to be up
delegate_to: localhost
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 2222
- name: Unlock encrypted disk
# No python in the initramfs. Not ideal - briefly exposes
# password via proc within initramfs environment.
# ansible.builtin.raw doesn't support passing directly to stdin.
ansible.builtin.raw: echo -n {{ unlock_key | quote }} | cryptroot-unlock
vars:
# Pulled this out so it can be replaced with a lookup in
# the future.
unlock_key: supersecretkey
ansible_user: root # Temporarily login directly as root
# Temporarily login to the dropbear initramfs daemon that
# I configured on a different port.
ansible_port: 2222
- name: Wait for host to be properly up
delegate_to: localhost
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 22
...
After getting this functioning, I went looking for a solution and found a feature request for stdin support however it has been closed due to its age. This did lead me (via https://github.com/gsauthof/dracut-sshd/issues/32) to https://github.com/gsauthof/playbook/blob/master/fedora/initramfs/ansible/unlock_tasks.yml, which uses a delegated (to localhost) command task to run ssh locally to issue the unlock command with the key advantage of being able to pass the password in on stdin. While this would be ugly in most contexts, it is little worse than using the raw
module and it is reasonable to expect the ssh
client to be installed where the playbook is being run (even on Windows). Whether this last assumption (ssh
client is installed) holds true when using containers to execute playbooks (as I presume Ansible Automation Platform does), I do not yet know.
Taking this method, to reduce the risk of exposing the unlock key, I ended up with this:
---
- hosts: "{{ UNLOCK_HOSTS | default('all') }}"
# Host will not be in a state to gather_facts if waiting to be unlocked
gather_facts: false
tasks:
- name: Attempt to find connection details if needed
# * uses host's `mac_address` variable
# * uses `all` group `dhcp_server_host` variable
ansible.builtin.include_role:
name: dhcp
tasks_from: lookup-host
# Cannot be part of the block or Ansible applies the when
# to all the included tasks, including those that are
# delegated (and hence the test evaluated against the
# delegated host rather than the current host).
when: >-
ansible_host == inventory_hostname
and
inventory_hostname is not ansible.utils.resolvable
- name: Wait for dropbear ssh daemon to be up
delegate_to: localhost
ansible.builtin.wait_for:
host: "{{ dhcp_lease.ip | default(hostvars[inventory_hostname]['ansible_host']) | default(inventory_hostname) }}"
port: 2222
- name: Unlock encrypted disk
# No python in the initramfs. Work around ansible.builtin.raw
# not supporting stdin (https://github.com/ansible/ansible/issues/34556)
delegate_to: localhost
# Accept new (but not changed) host keys - so first connection after
# install works (provided old key has been removed)
ansible.builtin.command:
cmd: >
ssh
-o StrictHostKeyChecking=accept-new
-p 2222
-l root
{{ dhcp_lease.ip | default(hostvars[inventory_hostname]['ansible_host']) | default(inventory_hostname) }}
cryptroot-unlock
stdin: "{{ unlock_key }}"
stdin_add_newline: false
vars:
unlock_key: >-
{{
lookup(
'community.hashi_vault.vault_read',
'kv/hosts/' + inventory_hostname + '/luks/passphrase'
).data.passphrase
}}
- name: Wait for host to be properly up
delegate_to: localhost
ansible.builtin.wait_for:
host: "{{ dhcp_lease.ip | default(hostvars[inventory_hostname]['ansible_host']) | default(inventory_hostname) }}"
port: 22
...
Reinstall playbook
To round this off, I wrote a reinstall playbook based on my previous work doign the same with Rocky Linux. It just destroys the partition table and reboots the host, before launching the install playbook above:
---
- hosts: localhost
# Don't use facts, so save some time by not bothering to gather them.
gather_facts: false
any_errors_fatal: true
tasks:
# If this task fails, Ansible will abort the whole playbook and not
# run subsequent plays on any host.
- ansible.builtin.assert:
that: REDEPLOY_HOSTS is defined
fail_msg: Set host to be deployed in variable REDEPLOY_HOSTS - note this action is destructive!
- hosts: '{{ REDEPLOY_HOSTS }}'
tasks:
# Required to blow away GPT partition table later. Do this early so
# if there's a problem installing it will fail early (before any
# destructive action has been taken)
- name: Ensure gdisk is installed
become: true
ansible.builtin.package:
name: gdisk
state: present
# Ironically, have to install this new package just to immediately
# destroy the machine with Ansible's ansible.builtin.expect module.
- name: Install pexpect python module
become: true
ansible.builtin.package:
name: python3-pexpect # For Rocky 8 - maybe different on others?
state: present
- name: Get list of partitions
ansible.builtin.set_fact:
disks_with_partitions: "{{ disks_with_partitions + [item.key] }}"
loop: "{{ ansible_facts.devices | dict2items }}"
vars:
disks_with_partitions: []
when: item.value.removable == '0' and item.value.partitions | length > 0
loop_control:
label: "{{ item.key }}"
- name: Destroy host's disk partition table(s) (to enable fall through to other boot methods for auto-reinstall)
become: true
ansible.builtin.expect:
command: gdisk /dev/{{ item }}
# Although this is a map (and therefore unordered), each prompt
# will only appear once so I am not worried about multiple
# matches happening.
responses:
# x == Enter expert mode
'Command \(\? for help\):': x
# z == zap (destroy) GPT partition table
'Expert command \(\? for help\):': z
# Ansible doesn't seem to substitute `{{ item }}` in a key,
# so have to do a looser match. Will always be on a disk,
# never a partition, so should not end with a digit. On my
# systems `[a-z]+` seems sufficient.
'About to wipe out GPT on /dev/[a-z0-9]+. Proceed\? \(Y/N\):': Y
'Blank out MBR\? \(Y/N\):': Y
loop: "{{ disks_with_partitions }}"
- name: Reboot host
become: true
# Ansible's reboot command waits, and checks, for the host to
# come back, which will never happen. Even with async (see below)
# there is a race condition if the ssh connection gets closed (by
# the shutdown process) before Ansible has disconnected so it is
# necessary to delay the shutdown command by longer than the
# async value, in order to avoid this problem.
ansible.builtin.shell: 'sleep 2 && /usr/bin/systemctl reboot --message="Ansible triggered reboot for system redeployment."'
# Run in background, waiting 1 second before closing connection
async: 1
# Launch in fire-and-forget mode - with a poll of 0 Ansible skips
# polling entirely and moves to the next task, which is precisely
# what we need.
poll: 0
- name: Old keys are removed from local known_hosts
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
ansible.builtin.known_hosts:
name: '{{ hostvars[inventory_hostname].ansible_host | default(inventory_hostname) }}'
state: absent
- name: Old crypt unlock keys are removed from local known_hosts
delegate_to: localhost
# Only allow one thread to update the known_hosts file at a time
throttle: 1
ansible.builtin.known_hosts:
name: '[{{ hostvars[inventory_hostname].ansible_host | default(inventory_hostname) }}]:2222'
state: absent
- name: Begin install process
ansible.builtin.import_playbook: install.yaml
vars:
INSTALL_HOSTS: '{{ REDEPLOY_HOSTS }}'
...
Ansible defaults
Hosts deployed this way should default to logging in with the ansible
user, using local keys (as ssh password login is entirely disabled), and using the system-specific password from the vault to escalate priviledges:
ansible_user: ansible
ansible_become_password: "{{ lookup('community.hashi_vault.vault_read', 'kv/hosts/' + inventory_hostname + '/users/ansible').data.password }}"
In the longer term, these settings should become the default for all of my systems.
Future actions
This post ends with a list of actions that need to be done at some point:
- Bring host-specific pxe config for automated preseeded (re)install under Ansible control (as opposed to manual)
- The list of scripts deployed to the preseed web-server and then fetched by the preseed configuration file should be turned into a single variable that is used for both (so there is one list to maintain)
- Modify DHCP with a “wait until a lease newer than x appears”, rather than rely on the client id being consistent between, for example, dropbear and the OS (then drop changing the network configuration from the preseed install process)
- Make the encryption unlock a role instead of a playbook
- Move replacing the generated luks key with a passphrase part of the same role and general enough that it can be also used to rotate the passphrases periodically