Linux healthcheck script

On my systems at home I use Icinga2 to monitor health, adding new checks as and when I identify something I think needs checking or if a failure occurs that was not detected. Sometimes it is necessary to do some checks via other means, such as SLURM’s healthcheck program so it can be useful to have checks in script form. On previous systems, we have used the Nagios plugins that Icinga uses to minimise the maintenance overhead of have duplicated tests. The script will be written in bash and minimise dependencies on non-Coreutils files to try and keep it portable to different distributions.

Skeleton

The basic layout of the test script’s directory is going to be:

run - Main script that runs the tests, reports success/failure of each script, provides a failure summary and exits status 2 if the check failed (to distinguish test failure from other types of failure, which will use the generic 1 exist status).
build-single-script - Creates a self-extracting and running version of the script (including tests) as a single file, to make it easier to copy to a remote system and run.
lib/ - Used to modularise the code, contains files sourced by the main script.
test.d/ - Contains the tests.

Finding itself

In order to include files from lib and run tests in test.d, the script needs to work out where it is. This is easiest done using realpath:

# Work out where the script is located
my_path="$( dirname "$( realpath "$0" )" )"

Process the command-line options

The first thing the script does is to process the command line options in lib/command-line.bash (after including the functions, which provides the usage function). Note that the configuration variable, CONFIG is not exported as it is only to be consumed in the healthcheck script (which sources this file), not by test scripts launched within it.

# Relies on GNU enhanced getopt (not posix compliant)
pre_processing_count=$#
eval set -- \
  $( \
    getopt \
    -l unicode,no-unicode,colour,no-colour,fatal-warnings,no-fatalwarnings,help \
    -o uUcCfFh \
    -- "$@" \
  )
# If getopt processed all arguments (no errors) then the post-processing
# argument count should be the pre-processing count plus one (for the
# '--' end marker).
if [[ $(( $pre_processing_count + 1 )) -ne $# ]]
then
  # Presumes getopt has output an error message (but could have been
  # given positional argument instead of parameter).
  echo "Invalid usage." >&2
  usage >&2
  exit 1
fi
# Get rid of temporary variable to avoid polluting script environment.
unset pre_processing_count

# Defaults - also consider this the authoritative list of options set
# by this scriptlet.
declare -A CONFIG=([unicode]=true [colour]=auto [fatal_warnings]=false)

while [ $# -gt 0 ]
do
  case "$1" in
    -u | --unicode)
      CONFIG[unicode]=true
      ;;
    -U | --no-unicode)
      CONFIG[unicode]=false
      ;;
    -c | --colour)
      CONFIG[colour]=true
      ;;
    -C | --no-colour)
      CONFIG[colour]=false
      ;;
    -f | --fatal-warnings)
      CONFIG[fatal_warnings]=true
      ;;
    -F | --no-fatal-warnings)
      CONFIG[fatal_warnings]=false
      ;;
    -h | --help)
      usage
      exit 0
      ;;
    --)
      # End of arguments marker
      shift
      break
      ;;
  esac
  shift
done

Useful variables

To avoid unnecessary duplication of the calculation of generally useful information, some variables are exported by the top-level script for tests to use. Booleans are set to the lowercase strings true or false. The variables are:

IS_ROOT - is the script being run as the root user.
CAN_SUDO - can the current user run sudo (n.b. this does not determine what commands or as who are permitted).
COLOUR_SUPPORT - does the current terminal claim to support colour?
COLOURS - an associative array of convenience with named variables for each of the properties and 8 basic colours:
- COLOURS[reset] - the code to clear the currently set colours/properties back to default.
- COLOURS[bold] - the code for bold (bright)
- COLOURS[dim] - the code for faint (decreased brightness)
- COLOURS[blink] - the code for blinking
- COLOURS[underline] - the code for underlined
- COLOURS[fg_*] - the code for foreground colours (for each of black, red, green, yellow, blue, magenta, cyan and white)
- COLOURS[bg_*] - the code for background colours (for each of black, red, green, yellow, blue, magenta, cyan and white)
DIST_FAMILY - distribution family (in lowercase)
DIST_DISTRIBUTION - exact distribution name (in lowercase)
DIST_VERSION - exact version number of distribution
DIST_VERSION_MAJOR - just the major part of the version number

If COLOUR_SUPPORT is false then COLOURS will be populated with empty strings, so that tests can blindly use them without needed to worry about supporting non-terminal or non-colour output. I.e. echo "${COLOURS[red]}Hello world!${COLOURS[reset]}" will work regardless of if colour is supported, the variables will simply be empty if it is not.

Distribution

The DIST_… variables are populated by taking my existing detection script and using it to create lib/distribution-detection.bash. This is then sourced by the main script:

# DIST_ variables
source "${my_path}/lib/distribution-detection.bash"

Privileges

IS_ROOT and CAN_SUDO are populated by lib/privilege-escalation.bash, which is two simple tests:

# Is the current UID the superuser (UID zero)?
if [[ $UID -eq 0 ]]
then
  IS_ROOT=true
else
  IS_ROOT=false
fi

# Can the current user in the current environment sudo with no password?
# Will succeed if either:
#   * configured for no password.
#   * user has already entered password and sudo has not timed out the
#     authentication yet.
if sudo -n -l &>/dev/null
then
  CAN_SUDO=true
else
  CAN_SUDO=false
fi

# Keep variable exports collated so it is easy to refer to, in
# order to see what variables are exported.
export IS_ROOT CAN_SUDO

Colours

Colour detection is done simply (possibly naïvely?) by checking if the terminal reports colour capability and testing if STDOUT (file descriptor 1) is connected to a terminal in lib/colour-terminal.bash:

declare -A colour_map=([black]=0 [red]=1 [green]=2 [yellow]=3 [blue]=4 \
  [magenta]=5 [cyan]=6 [white]=7)

# Non-colour values
declare -A COLOURS=([reset]=$'\x1b[0m' [bold]=$'\x1b[1m' [dim]=$'\x1b[2m' \
  [underlined]=$'\x1b[4m' [blink]=$'\x1b[5m')
# Foreground and background colours
for key in "${!colour_map[@]}"
do
  COLOURS["fg_${key}"]="${v:-$'\x1b'}[3${colour_map[$key]}m"
  COLOURS["bg_${key}"]="${v:-$'\x1b'}[4${colour_map[$key]}m"
done

# If colour mode is forced, or automatic and the terminal reports colour
# support and STDOUT(fd 1) is connected to a terminal (e.g. as opposed
# to a pipe).
if [[ ${CONFIG[colour]} = "true" ]] || \
  ( \
    [[ ${CONFIG[colour]} = "auto" ]] && \
    [[ "$(tput colors)" -gt 0 ]] && \
    [ -t 1 ] \
  )
then
    COLOUR_SUPPORT=true
else
    COLOUR_SUPPORT=false
    # Set the colour array values to empty string, so they can be used
    # without the author worrying about whether colour works.
    for key in "${!COLOURS[@]}"
    do
      COLOURS[$key]=""
    done
fi

# unset temporary variables to avoid environment pollution when sourced.
unset colour_map

# Keep exports together to easily see what this script intentionally exports.
export COLOUR_SUPPORT COLOURS

Functions

Two main functions are in lib/functions.bash - usage(), which displays a help message, and run_test(), which will run a single test and return a pass (0) or fail (test exit status) status code.

usage

This deliberately prints unicode characters in the help message, to aid users in seeing if the unicode characters are supported by their present font.

usage() {
  local unicode_tick=$'\u2713'
  local unicode_cross=$'\u2717'
  cat - <<EOF
Usage:
  $0 [-u|-U|--unicode|--no-unicode] [-c|-C|--colour|--no-colour] [-f|-F|--fatal-warnings|--no-fatal-warnings] [-hZ--help]

-u|--unicode: use characters ${unicode_tick}, ${unicode_cross} and ! to
              report pass/fail/warnings (default)
-U|--no-unicode: use words [ ok ], [FAIL] and ([WARN])[warn] to report
                  pass/fail/(fatal)warnings

-c|--colour: force colour output
-C|--no-colour: force non-colour output
(default is to use colour if output is a terminal and reports colour support,
 not otherwise)

-f|--fatal-warnings: warnings (e.g. test is not executable) are considered
                     failures
-F|--no-fatal-warnings: warnings (e.g. test is not executable) are printed
                        but not considered failures (default)

-h|--help: display this message and exit

EOF
}

run_test

run_test() {
  local script="$1"
  local test_name="$( basename "${script}" )"

  # Support bash/python and c/php style comments
  local comment_regex="\(#\|//\)"
  local comment_marker="TEST_DESCRIPTION:"
  local test_description=""

  if grep -q "${comment_regex}${comment_marker}" "${script}"
  then
    # Challenge here is finding a delimiter for sed's substitution that
    # is not likely to be used in the comment regex or marker (ruling
    # out the usual candidates of /, # and ^. % seemed the best choice
    # (it rules out using MATLAB and LaTeX comment markers in the regex)
    test_description="$( \
      grep -o "${comment_regex}${comment_marker}.*" "${script}" | \
      sed "s%${comment_regex}${comment_marker}\\s*%%" \
    )"
  fi

  # Make sure to update the list with any new output options, to keep
  # this declaration the authoritative list of all that need setting.
  local -A output=([good]="" [bad]="" [warn]="" [warn_fatal]="")
  if [[ ${CONFIG[unicode]} = "true" ]]
  then
    output[good]="${COLOURS[fg_green]}${v:-$'\u2713'}${COLOURS[reset]}"
    output[bad]="${COLOURS[bold]}${COLOURS[fg_red]}${v:-$'\u2717'}${COLOURS[reset]}"
    output[warn]="${COLOURS[fg_yellow]}!${COLOURS[reset]}"
    output[warn_fatal]="${COLOURS[bold]}${COLOURS[fg_red]}!${COLOURS[reset]}"
  else
    # Non-unicode feedback messages
    output[good]="[ ${COLOURS[fg_green]}ok${COLOURS[reset]} ]"
    output[bad]="[${COLOURS[bold]}${COLOURS[fg_red]}FAIL${COLOURS[reset]}]"
    output[warn]="[${COLOURS[fg_yellow]}warn${COLOURS[reset]}]"
    output[warn_fatal]="[${COLOURS[bold]}${COLOURS[fg_red]}WARN${COLOURS[reset]}]"
  fi

  # Only displays description (in brackets) after the test name if it is
  # set.
  echo "${COLOURS[fg_cyan]}>>${COLOURS[reset]} Running test" \
    "${test_name}${test_description:+" (${test_description})"}..." >&2

  if [ -x "${script}" ]
  then
    # Run the test script
    ${script}
    local test_result=$?
    if [[ $test_result -eq 0 ]]
    then
      echo "${output[good]} test passed." >&2
    else
      echo "${output[bad]} test failed with status ${test_result}." >&2
    fi
  else
    local warn_msg="test is not executable. Unable to run."
    if [[ ${CONFIG[fatal_warnings]} = "true" ]]
    then
      echo "${output[warn_fatal]} ${warn_msg}" >&2
      local test_result=1
    else
      echo "${output[warn]} ${warn_msg}" >&2
      local test_result=0
    fi
  fi

  return ${test_result}
}

Tests

The tests in test.d can be any executable file (e.g. script, binary program, etc.) that exits with a status of zero (0) if the test passes and non-zero if it fails.

They can contain a comment (# or // style comments are supported) beginning with the text TEST_DESCRIPTION: that provides a brief (to be printed alongside the test name, which is always the filename (so the can be no confusion about which test in test.d it is), in the output). There must be no space between the comment marker and TEST_DESCRIPTION: but can be optional spaces, which will be removed, between that and the description. For example, a bash or Python script might contain #TEST_DESCRIPTION: this test's description.

For commands known to require elevated privileges, or run as a different user, the tests should use the variables IS_ROOT and CAN_SUDO to determine the appropriate mechanism for running those commands (i.e. can run directly or wrap the command in sudo, respectively and whether to use su or sudo to become another user). It should not be assumed sudo is even installed on a system. Tests should bypass anything requiring elevated privileges (printing a warning message to STDERR) if no elevation route is available - that is all tests should be runnable (even in limited form) by an unprivileged user (and the test pass in the absence of any failures).

distribution

Example test which performs no checks but prints the detected distribution:

#!/bin/bash
#TEST_DESCRIPTION: display detected distribution information

# Just display the distribution - never fails (we could check if it's recognised?)
echo "Detected ${DIST_DISTRIBUTION} (${DIST_FAMILY} family) version" \
  "${DIST_VERSION} (major version number ${DIST_VERSION_MAJOR})."

privileges

Example test which performs no checks but prints the detected capabilities of the current user:

#!/bin/bash
#TEST_DESCRIPTION: summary of the current user's detected privileges

if [[ ${IS_ROOT} = "true" ]]
then
  am_root="is"
else
  am_root="is not"
fi

# XXX bad practice - reusing variable name in different case.
if [[ ${CAN_SUDO} = "true" ]]
then
  can_sudo="can"
else
  can_sudo="cannot"
fi

echo "Current user ${am_root} root and ${can_sudo} sudo."

fail

Example test that always fails:

#!/bin/bash
#TEST_DESCRIPTION: always fail - for testing

echo "This test deliberately fails!" >&2
exit 2

Main script

In full, the main script that runs each test and summarises the failures:

#!/bin/bash

# Standard bash safety - disable accidental globbing, no uninitialised
# variables, errors are fatal, errors in pipes cause pipe to error
set -fueo pipefail

# Work out where the script is located
my_path="$( dirname "$( realpath "$0" )" )"

# run_test() and usage() functions
source "${my_path}/lib/functions.bash"

# Process command line options and populates `config` variable
source "${my_path}/lib/command-line.bash"

# Colour support detection
source "${my_path}/lib/colour-support.bash"

# Priviledge escalation detection
source "${my_path}/lib/privilege-escalation.bash"

# DIST_ variables
source "${my_path}/lib/distribution-detection.bash"

# Do tests
declare -a failed_tests # Array to keep a list of failing tests
#enable globbing
set +f
# Allow the '*' to match nothing (in the case of no tests exist)
shopt -s nullglob
# Globs are expanded alphabetically (see Bash manual), so no need to
# do anything special to run them in sequence.
for test_ in "${my_path}"/test.d/*
do
  # Disable null-globbing (default) and turn off accidental globbing
  # again.
  shopt -u nullglob ; set -f
  if ! run_test "${test_}"
  then
    failed_tests+=("$( basename "${test_}" )")
  fi
done
# Just in case the look didn't get entered - reinforce disabling
# accidental globbing.
shopt -u nullglob ; set -f

if [[ "${#failed_tests[@]}" -eq 0 ]]
then
  echo "${COLOURS[underlined]}All tests passed.${COLOURS[reset]}"
  exit 0
else
  echo "${COLOURS[underlined]}Some tests failed.${COLOURS[reset]}"
  echo "List of failed tests:"
  for failure in "${failed_tests[@]}"
  do
    echo "  * ${failure}"
  done
  # Use 2 to distinguish "some test failed" from "some unintended error
  # occurred"
  exit 2
fi

build-single-script

This script takes the main script, lib and all tests and bundles them into an archive that is prepended by a bash script that makes it a self-extracting and running script. This script takes a non-optional argument, which is the name of the script to create.

#!/bin/bash

# Standard bash safety - disable accidental globbing, no uninitialised
# variables, errors are fatal, errors in pipes cause pipe to error
set -fueo pipefail

# Work out where the script is located
my_path="$( dirname "$( realpath "$0" )" )"

# Helper to print usage message
usage() {
  cat - <<EOF
Usage:
  $0 [-h|--help|script_name_to_create]

script_name_to_create: name of the script that will be created (must not
                       exist)

-h|--help: display this message and exit

EOF
}

if [[ $# -ne 1 ]]
then
  # No enough arguments
  echo "Incorrect number of arguments: $#" >&2
  usage >&2
  exit 1
elif [[ $1 = '-h' ]] || [[ $1 = '--help' ]]
then
  # Explicit help request
  usage
  exit 0
fi

script_name="$1"

if [[ -e ${script_name} ]]
then
  echo "Error: ${script_name} already exists." >&2
  exit 1
fi

# Script preamble - quoting the heredoc tag disables interpolations
cat - >"${script_name}" <<'END'
#!/bin/bash

# Standard bash safety - disable accidental globbing, no uninitialised
# variables, errors are fatal, errors in pipes cause pipe to error
set -fueo pipefail

# Helper to print usage message
usage() {
  cat - <<EOF
Usage:
  $0 [-h|--help] [-ttempdir|--tmpdir=tempdir] [--] [run_arguments]

-t|--tmpdir: Specify (as an argument to this option) the temporary
             directory to extract to - must allow execution (i.e.
             not be mounted "noexec"). Defaults to TMPDIR
             environment variable, if set, or /tmp if not.

--: anything after this marker will be passed to the extracted
    healthcheck script's command line options.

-h|--help: display this message and exit

EOF
}

# Process commandline
eval set -- $(getopt -l tmpdir:,help -o t:h -- "$@")

temp_dir="${TMPDIR:-/tmp}"

while [ $# -gt 0 ]
do
  case "$1" in
    -t | --tempdir)
      temp_dir="$2"
      shift # need an extra script - processing 2 arguments here
      ;;
    -h | --help)
      usage
      exit 0
      ;;
    --)
      # End of arguments marker
      shift
      break
      ;;
  esac
  shift
done
# Processed our arguments and shifted the end of arguments marker -
# everything left is for the 'run' script

# Make a temporary directory for the files
out_dir="$( mktemp -d -p "${temp_dir}")"

# Extract the archive at the end of this script
echo "Extracting healthcheck to ${out_dir}..."
sed -e '1,/^__END__$/d' "$0" | tar -C "${out_dir}" -zx

# Run the run script
set +e # Want to tidy up, even if run fails
# Run the script with all remaining (not consumed by our options
# processing) arguments.
echo "Running healthcheck..."
"${out_dir}/run" "$@"
set -e # Die if this script errors, though.

# Tidy up the temporary files
echo "Deleting ${out_dir}..."
rm -rf "${out_dir}"

# End of script - archive follows
echo "All done."
exit

__END__
END

# Create the archive at the end of the file
tar -C "${my_path}" -zc run lib/ test.d/ >> "${script_name}"

echo "Created ${script_name}."

Here is what asking for the healthcheck script’s help with the self-extracting script looks like:

$ bash test -- -h
Extracting healthcheck to /tmp/tmp.UWWlnK7f8Y...
Running healthcheck...
Usage:
  /tmp/tmp.UWWlnK7f8Y/run [-u|-U|--unicode|--no-unicode] [-c|-C|--colour|--no-colour] [-f|-F|--fatal-warnings|--no-fatal-warnings] [-h|--help]

-u|--unicode: use characters ✓, ✗ and ! to
              report pass/fail/warnings (default)
-U|--no-unicode: use words [ ok ], [FAIL] and ([WARN])[warn] to report
                 pass/fail/(fatal)warnings

-c|--colour: force colour output
-C|--no-colour: force non-colour output
(default is to use colour if output is a terminal and reports colour support,
 not otherwise)

-f|--fatal-warnings: warnings (e.g. test is not executable) are considered
                     failures
-F|--no-fatal-warnings: warnings (e.g. test is not executable) are printed
                        but not considered failures (default)

-h|--help: display this message and exit

Deleting /tmp/tmp.UWWlnK7f8Y...
All done.