Fully automated reinstall of Rocky Linux with Ansible and Puppet

This post is about automating, end-to-end, reinstalling a booted Rocky Linux system. The automation is done using Ansible but Puppet is used to configure the hosts, so Ansible orchestrates Puppet too. This is closely related to my recent posts on kickstart graphical feedback (tentatively related to the Puppet piece) and generating custom install isos (for complete hands-off reinstalls). All of the systems involved are bound to a Microsoft Active Directory [AD], so common credentials can be used and the computer that is being reinstalled will need its computer object resetting to rebind automatically. The local ssh host key cache is also updated after the reinstall is finished too, do avoid “the host key has changed” warnings post re-install.

A future enhancement would be to integrate with out-of-band systems (physical and virtual - e.g. HP’s ILO, Dell’s iDRAC, Proxmox, vCenter etc.) to dynamically change the boot order and/or attach (host-specific) ISOs prior to rebooting to do the install. This would remove the reliance on the boot order being correct for the installation to start.

I will step through the playbook in sections an explain what it is doing, however you can skip to the complete playbook at the end of this post.

Preliminaries

The VMs in my test labs are set to boot from the internal hard disk then (depending on the lab) either a mounted ISO image (with an embedded host-specific kickstart) or PXE boot (with per-host targeted boot). This results in them doing a fully hands-off install if they are unable to boot from the local hard disk and fall through to the 2nd boot option. This playbook relies on that behaviour for this to be a “one command” (ansible-playbook ...) process.

These variables need to be set (e.g. in the inventory):

puppet_host - hostname of the puppet master server where the certificates for this host are held.
domain_controller - hostname of a domain controller to be used to reset the target computer’s AD object.

You may notice that I have tagged the various tasks - this helps with running subsets in order to recover from partial failures, either of the playbook or the install processes. It is very useful during development.

Step 1 - Protecting us from ourselves

As a safety feature, the playbook refuses to do anything unless the magic variable REDEPLOY_HOST is set. This can be passed on the command line using ansible-playbook’s -e (extra vars) option - in fact I strongly recommend only doing it this way, so you are forced to expressly say on the command line which host(s) you want to reinstall. The playbook will, by design, irrecoverably destroy the operating system on the hosts it targets without prompting - you have been warned!

The first play in the playbook (there are two) is just to verify this variable is set and abort early if not:

- hosts: localhost
  # Don't use facts, so save some time by not bothering to gather them.
  gather_facts: false
  any_errors_fatal: true
  tasks:
    # If this task fails, Ansible will abort the whole playbook and not
    # run subsequent plays on any host.
    - ansible.builtin.assert:
        that: REDEPLOY_HOST is defined
        fail_msg: Set host to be deployed in variable REDEPLOY_HOST - note this action is destructive!

Step 2 - Target the host(s) to be reinstalled

This very standard start of a play - select the target. None of what we do requires facts, so do not bother to gather them.

- hosts: '{{ REDEPLOY_HOST }}'
  # Don't use facts, so save some time by not bothering to gather them.
  gather_facts: false
  tasks:

Step 3 - Install required packages on the target to be reinstalled

Slightly ironically, in order to destroy the system two additional packages are required:

gdisk - provides an efficient mechanism to wipe the partition table, which will both cause the next hard disk boot to fail, triggering the fall through that launches the automated installer, and means the installer will see an apparently blank disk to reinstall to.
pexpect python module - allows Ansible to “drive” gdisk.

The first tasks of the play are installing these:

    # Required to blow away GPT partition table later. Do this early so
    # if there's a problem installing it will fail early (before any
    # destructive action has been taken)
    - name: Ensure gdisk is installed
      become: true
      ansible.builtin.package:
        name: gdisk
        state: present
    # Ironically, have to install this new package just to immediately
    # destroy the machine with Ansible's ansible.builtin.expect module.
    - name: Install pexpect python module
      become: true
      ansible.builtin.package:
        name: python3-pexpect # For Rocky 8 - maybe different on others?
        state: present

Step 4 - Clean the target’s Puppet certificate

The next step is to remove the hosts existing certificate on the Puppet master - this must be done before the new Puppet agent is installed and attempts to submit a new certificate or it will not be easy to accept the new one post reinstall.

This could be done later, e.g. with the AD computer object reset, provided it is done before the reboot in order to guarantee that the old certificate is gone before the new Puppet agent is setup. It is done first for historic reasons; originally the AD object reset was also done here too, so if cleaning the old credentials failed, the play failed before destroying the old installation but this created a problem as the computer could not be logged into (to do the destruction) with AD credentials once the object in the AD was reset.

I do first check the certificate exists, as if the previous install failed in some interesting way it might not. Removing the certificate is a safe thing to skip if it does not exist.

    - name: Check if Puppet certificate exists
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.command: /opt/puppetlabs/bin/puppet cert list {{ REDEPLOY_HOST | lower }}
      register: cert_check_output
      changed_when: false  # Always a read-only action, never a change
      failed_when: cert_check_output.rc not in [0, 24] # 24 == cert missing (already removed?)
      tags: ['puppet']
    - name: Remove current Puppet certificate
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.command: /opt/puppetlabs/bin/puppet cert clean {{ REDEPLOY_HOST | lower }}
      when: cert_check_output.rc == 0
      tags: ['puppet']

Step 5 - Blow away the target’s partition table

This is where we destroy the target host’s partition table. I have hardcoded this to destroy both /dev/sda and /dev/sdb - which is highly environment specific. Using ansible_facts to lookup what disks actually exist would be better, in which case gather_facts: false needs removing from this play’s preamble, but this works in my current lab environments.

    - name: Destroy host's disk partition table(s) (to enable fall through to other boot methods for auto-reinstall)
      become: true
      ansible.builtin.expect:
         command: gdisk {{ item }}
         # Although this is a map (and therefore unordered), each prompt
         # will only appear once so I am not worried about multiple
         # matches happening.
         responses:
           # x == Enter expert mode
           'Command \(\? for help\):': x
           # z == zap (destroy) GPT partition table
           'Expert command \(\? for help\):': z
           # Ansible doesn't seem to substitute `{{ item }}` in a key,
           # so have to do a looser match. Will always be on a disk,
           # never a partition, so should not end with a digit. On my
           # systems `[a-z]+` seems sufficient.
           'About to wipe out GPT in /dev/[a-z]+. Proceed\? \(Y/N\):': Y
           'Blank out MBR\? \(Y/N\):': Y
      loop:
        - /dev/sda
        - /dev/sdb

Step 6 - Reboot the target

Here we reboot the target, which (now it has no partition table) should cause it to fall through to the next boot method that causes it to begin a new installation. Note that the Ansible ansible.builtin.reboot module cannot be used because it expects the host to come back after successfully rebooting - this will not happen because we have destroyed the old install.

    - name: Reboot host
      become: true
      # Ansible's reboot command waits, and checks, for the host to
      # come back, which will never happen. Even with async (see below)
      # there is a race condition if the ssh connection gets closed (by
      # the shutdown process) before Ansible has disconnected so it is
      # necessary to delay the shutdown command by longer than the
      # async value, in order to avoid this problem.
      ansible.builtin.shell: 'sleep 2 && /usr/bin/systemctl reboot --message="Ansible triggered reboot for system redeployment."'
      # Run in background, waiting 1 second before closing connection
      async: 1
      # Launch in fire-and-forget mode - with a poll of 0 Ansible skips
      # polling entirely and moves to the next task, which is precisely
      # what we need.
      poll: 0

Step 7 - Reset the target’s AD object

This step resets the computer’s AD object’s password to be the host’s lowercased short hostname followed by a $ - which is exactly what clicking “Reset Account” on the computer object does. Microsoft’s own documentation on resetting accounts confirms this, when it describes resetting via Active Directory Users and Computers (DSA) and resetting via Microsoft Visual Basic script.

Between steps 6 and 7 we theoretically have a race condition, if the computer gets as far as trying to join the AD before the object is reset however in reality this is never going to happen. The AD join (whether in %post or via kickstart’s realm join command) occurs towards the end of the install which takes several minutes (even with the smallest possible package selection) so by the time this happens (barring a failure of the command) this step will always have been completed.

    - name: Reset computer object password
      # Note this must be done after we have finished using the AD to
      # authenticate to the computer (e.g. logging in, using sudo etc.)
      delegate_to: '{{ domain_controller }}'
      # On newer versions of Ansible:
      #ansible.builtin.win_shell: |
      win_shell: |
        Get-ADComputer {{ REDEPLOY_HOST.split('.')[0] }} | Set-ADAccountPassword -NewPassword:$( ConvertTo-SecureString {{ REDEPLOY_HOST.split('.')[0] | lower }}\$ -asPlainText -Force ) -Reset:$true
      tags: ['ad']

Step 8 - Remove the target’s old host key from local `known_hosts`

This step could happen later but by this point the target’s old OS has been irrecoverably destroyed so it is logical to remove it now.

    - name: Remove old key from local known_hosts
      delegate_to: localhost
      ansible.builtin.known_hosts:
        name: '{{ ansible_host | default(inventory_hostname) }}'
        state: absent
      tags: ['ssh']

Step 9 - Wait for, then sign, the new certificate signing request on the Puppet master

By now the target will be installing a fresh OS, so we just need Ansible to sit and wait for the certificate request to appear when the kickstart runs the Puppet agent for the first time in its %post section. As soon as the request appears it can be signed, which will allow Puppet agent to continue configuring the new install. I mentioned in my earlier post about providing progress relating to launching Puppet through the kickstart’s %post how this created a race condition with Ansible signing the new certificate before the earlier version of my progress reporting script had chance to display the message it was ready to be signed.

    - name: Wait for reinstalled puppet agent to submit certificate request
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.wait_for:
        path: /etc/puppetlabs/puppet/ssl/ca/requests/{{ REDEPLOY_HOST | lower }}.pem
        state: present
        # 3 days (259,200 seconds) seems like a good value - allows time
        # even if this is being done over a weekend (but not Bank
        # Holiday?) to look at it if there's a problem with the
        # reinstall.
        timeout: 259200
      tags: ['puppet']
    - name: Sign the new Puppet certificate
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.command: /opt/puppetlabs/bin/puppet cert sign{{ REDEPLOY_HOST | lower }}
      tags: ['puppet']

Step 10 - Wait for ssh to come up on the new install

As the installer does not start ssh, this is essentially a wait for the install to completely finish and the host to reboot into the new OS. I separated this as a distinct step to updating the target’s ssh host key (Step 11) because testing the install is finished is a separately useful step. In the future more things may want to be done after it, prior to (or as well as) updating the host key.

    # SSH server will start once puppet had finished and host rebooted
    - name: Wait for ssh to come up on new install
      delegate_to: localhost
      ansible.builtin.wait_for:
        host: '{{ ansible_host | default(inventory_hostname) }}'
        port: 22
        timeout: 3600 # Wait upto 1 hour for puppet to finish and node to reboot
      tags: ['ssh']

Step 11 - Update the local cached host key of the target for the new install

This is the final step, update the local known_hosts file with the target’s new install’s new ssh host key. This is both a convenience (will not have to manually accept it on subsequent connection) and good for security because we store the host key as soon as the install finished, so will know if it changes between this point and any future connection attempt - significantly reducing the window for compromise.

    - name: Fetch new SSH host key
      delegate_to: localhost
      ansible.builtin.command: /usr/bin/ssh-keyscan -H -T10 -tecdsa "{{ hostvars[inventory_hostname]['ansible_host'] | default(inventory_hostname) }}"
      register: new_ssh_hostkey
      changed_when: false  # Always a read operation, never changes anything
      tags: ['ssh']
    - name: Add new host key to known_hosts
      delegate_to: localhost
      ansible.builtin.known_hosts:
        name: "{{ hostvars[inventory_hostname]['ansible_host'] | default(inventory_hostname) }}'
        key: '{{ new_ssh_hostkey.stdout }}'
      tags: ['ssh']

The full playbook

This is the full redeploy-host.yaml playbook, which can be run with ansible-playbook -i inventory.yaml -k -K -e REDEPLOY_HOST=host_to_reinstall redeploy-host.yaml

---
- hosts: localhost
  # Don't use facts, so save some time by not bothering to gather them.
  gather_facts: false
  any_errors_fatal: true
  tasks:
    # If this task fails, Ansible will abort the whole playbook and not
    # run subsequent plays on any host.
    - ansible.builtin.assert:
        that: REDEPLOY_HOST is defined
        fail_msg: Set host to be deployed in variable REDEPLOY_HOST - note this action is destructive!
- hosts: '{{ REDEPLOY_HOST }}'
  # Don't use facts, so save some time by not bothering to gather them.
  gather_facts: false
  tasks:
    # Required to blow away GPT partition table later. Do this early so
    # if there's a problem installing it will fail early (before any
    # destructive action has been taken)
    - name: Ensure gdisk is installed
      become: true
      ansible.builtin.package:
        name: gdisk
        state: present
    # Ironically, have to install this new package just to immediately
    # destroy the machine with Ansible's ansible.builtin.expect module.
    - name: Install pexpect python module
      become: true
      ansible.builtin.package:
        name: python3-pexpect # For Rocky 8 - maybe different on others?
        state: present
    - name: Check if Puppet certificate exists
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.command: /opt/puppetlabs/bin/puppet cert list {{ REDEPLOY_HOST | lower }}
      register: cert_check_output
      changed_when: false  # Always a read-only action, never a change
      failed_when: cert_check_output.rc not in [0, 24] # 24 == cert missing (already removed?)
      tags: ['puppet']
    - name: Remove current Puppet certificate
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.command: /opt/puppetlabs/bin/puppet cert clean {{ REDEPLOY_HOST | lower }}
      when: cert_check_output.rc == 0
      tags: ['puppet']
    - name: Destroy host's disk partition table(s) (to enable fall through to other boot methods for auto-reinstall)
      become: true
      ansible.builtin.expect:
         command: gdisk {{ item }}
         # Although this is a map (and therefore unordered), each prompt
         # will only appear once so I am not worried about multiple
         # matches happening.
         responses:
           # x == Enter expert mode
           'Command \(\? for help\):': x
           # z == zap (destroy) GPT partition table
           'Expert command \(\? for help\):': z
           # Ansible doesn't seem to substitute `{{ item }}` in a key,
           # so have to do a looser match. Will always be on a disk,
           # never a partition, so should not end with a digit. On my
           # systems `[a-z]+` seems sufficient.
           'About to wipe out GPT in /dev/[a-z]+. Proceed\? \(Y/N\):': Y
           'Blank out MBR\? \(Y/N\):': Y
      loop:
        - /dev/sda
        - /dev/sdb
    - name: Reboot host
      become: true
      # Ansible's reboot command waits, and checks, for the host to
      # come back, which will never happen. Even with async (see below)
      # there is a race condition if the ssh connection gets closed (by
      # the shutdown process) before Ansible has disconnected so it is
      # necessary to delay the shutdown command by longer than the
      # async value, in order to avoid this problem.
      ansible.builtin.shell: 'sleep 2 && /usr/bin/systemctl reboot --message="Ansible triggered reboot for system redeployment."'
      # Run in background, waiting 1 second before closing connection
      async: 1
      # Launch in fire-and-forget mode - with a poll of 0 Ansible skips
      # polling entirely and moves to the next task, which is precisely
      # what we need.
      poll: 0
    - name: Reset computer object password
      # Note this must be done after we have finished using the AD to
      # authenticate to the computer (e.g. logging in, using sudo etc.)
      delegate_to: '{{ domain_controller }}'
      # On newer versions of Ansible:
      #ansible.builtin.win_shell: |
      win_shell: |
        Get-ADComputer {{ REDEPLOY_HOST.split('.')[0] }} | Set-ADAccountPassword -NewPassword:$( ConvertTo-SecureString {{ REDEPLOY_HOST.split('.')[0] | lower }}\$ -asPlainText -Force ) -Reset:$true
      tags: ['ad']
    - name: Remove old key from local known_hosts
      delegate_to: localhost
      ansible.builtin.known_hosts:
        name: "{{ hostvars[inventory_hostname]['ansible_host'] | default(inventory_hostname) }}"
        state: absent
      tags: ['ssh']
    - name: Wait for reinstalled puppet agent to submit certificate request
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.wait_for:
        path: /etc/puppetlabs/puppet/ssl/ca/requests/{{ REDEPLOY_HOST | lower }}.pem
        state: present
        # 3 days (259,200 seconds) seems like a good value - allows time
        # even if this is being done over a weekend (but not Bank
        # Holiday?) to look at it if there's a problem with the
        # reinstall.
        timeout: 259200
      tags: ['puppet']
    - name: Sign the new Puppet certificate
      delegate_to: '{{ puppet_host }}'
      become: true
      ansible.builtin.command: /opt/puppetlabs/bin/puppet cert sign {{ REDEPLOY_HOST | lower }}
      tags: ['puppet']
    # SSH server will start once puppet had finished and host rebooted
    - name: Wait for ssh to come up on new install
      delegate_to: localhost
      ansible.builtin.wait_for:
        host: "{{ hostvars[inventory_hostname]['ansible_host'] | default(inventory_hostname) }}"
        port: 22
        timeout: 3600 # Wait upto 1 hour for puppet to finish and node to reboot
      tags: ['ssh']
    - name: Fetch new SSH host key
      delegate_to: localhost
      ansible.builtin.command: /usr/bin/ssh-keyscan -H -T10 -tecdsa "{{ hostvars[inventory_hostname]['ansible_host'] | default(inventory_hostname) }}"
      register: new_ssh_hostkey
      changed_when: false  # Always a read operation, never changes anything
      tags: ['ssh']
    - name: Add new host key to known_hosts
      delegate_to: localhost
      ansible.builtin.known_hosts:
        name: "{{ hostvars[inventory_hostname]['ansible_host'] | default(inventory_hostname) }}"
        key: '{{ new_ssh_hostkey.stdout }}'
      tags: ['ssh']
...