Linux healthcheck script
On my systems at home I use Icinga2 to monitor health, adding new checks as and when I identify something I think needs checking or if a failure occurs that was not detected. Sometimes it is necessary to do some checks via other means, such as SLURM’s healthcheck program so it can be useful to have checks in script form. On previous systems, we have used the Nagios plugins that Icinga uses to minimise the maintenance overhead of have duplicated tests. The script will be written in bash and minimise dependencies on non-Coreutils files to try and keep it portable to different distributions.
Skeleton
The basic layout of the test script’s directory is going to be:
run
- Main script that runs the tests, reports success/failure of each script, provides a failure summary and exits status2
if the check failed (to distinguish test failure from other types of failure, which will use the generic1
exist status).build-single-script
- Creates a self-extracting and running version of the script (including tests) as a single file, to make it easier to copy to a remote system and run.lib/
- Used to modularise the code, contains files sourced by the main script.test.d/
- Contains the tests.
Finding itself
In order to include files from lib
and run tests in test.d
, the script needs to work out where it is. This is easiest done using realpath
:
# Work out where the script is located
my_path="$( dirname "$( realpath "$0" )" )"
Process the command-line options
The first thing the script does is to process the command line options in lib/command-line.bash
(after including the functions, which provides the usage
function). Note that the configuration variable, CONFIG
is not exported as it is only to be consumed in the healthcheck script (which sources this file), not by test scripts launched within it.
# Relies on GNU enhanced getopt (not posix compliant)
pre_processing_count=$#
eval set -- \
$( \
getopt \
-l unicode,no-unicode,colour,no-colour,fatal-warnings,no-fatalwarnings,help \
-o uUcCfFh \
-- "$@" \
)
# If getopt processed all arguments (no errors) then the post-processing
# argument count should be the pre-processing count plus one (for the
# '--' end marker).
if [[ $(( $pre_processing_count + 1 )) -ne $# ]]
then
# Presumes getopt has output an error message (but could have been
# given positional argument instead of parameter).
echo "Invalid usage." >&2
usage >&2
exit 1
fi
# Get rid of temporary variable to avoid polluting script environment.
unset pre_processing_count
# Defaults - also consider this the authoritative list of options set
# by this scriptlet.
declare -A CONFIG=([unicode]=true [colour]=auto [fatal_warnings]=false)
while [ $# -gt 0 ]
do
case "$1" in
-u | --unicode)
CONFIG[unicode]=true
;;
-U | --no-unicode)
CONFIG[unicode]=false
;;
-c | --colour)
CONFIG[colour]=true
;;
-C | --no-colour)
CONFIG[colour]=false
;;
-f | --fatal-warnings)
CONFIG[fatal_warnings]=true
;;
-F | --no-fatal-warnings)
CONFIG[fatal_warnings]=false
;;
-h | --help)
usage
exit 0
;;
--)
# End of arguments marker
shift
break
;;
esac
shift
done
Useful variables
To avoid unnecessary duplication of the calculation of generally useful information, some variables are exported by the top-level script for tests to use. Booleans are set to the lowercase strings true
or false
. The variables are:
IS_ROOT
- is the script being run as theroot
user.CAN_SUDO
- can the current user runsudo
(n.b. this does not determine what commands or as who are permitted).COLOUR_SUPPORT
- does the current terminal claim to support colour?COLOURS
- an associative array of convenience with named variables for each of the properties and 8 basic colours:COLOURS[reset]
- the code to clear the currently set colours/properties back to default.COLOURS[bold]
- the code for bold (bright)COLOURS[dim]
- the code for faint (decreased brightness)COLOURS[blink]
- the code for blinkingCOLOURS[underline]
- the code for underlinedCOLOURS[fg_*]
- the code for foreground colours (for each ofblack
,red
,green
,yellow
,blue
,magenta
,cyan
andwhite
)COLOURS[bg_*]
- the code for background colours (for each ofblack
,red
,green
,yellow
,blue
,magenta
,cyan
andwhite
)
DIST_FAMILY
- distribution family (in lowercase)DIST_DISTRIBUTION
- exact distribution name (in lowercase)DIST_VERSION
- exact version number of distributionDIST_VERSION_MAJOR
- just the major part of the version number
If COLOUR_SUPPORT
is false
then COLOURS
will be populated with empty strings, so that tests can blindly use them without needed to worry about supporting non-terminal or non-colour output. I.e. echo "${COLOURS[red]}Hello world!${COLOURS[reset]}"
will work regardless of if colour is supported, the variables will simply be empty if it is not.
Distribution
The DIST_
… variables are populated by taking my existing detection script and using it to create lib/distribution-detection.bash
. This is then sourced by the main script:
# DIST_ variables
source "${my_path}/lib/distribution-detection.bash"
Privileges
IS_ROOT
and CAN_SUDO
are populated by lib/privilege-escalation.bash
, which is two simple tests:
# Is the current UID the superuser (UID zero)?
if [[ $UID -eq 0 ]]
then
IS_ROOT=true
else
IS_ROOT=false
fi
# Can the current user in the current environment sudo with no password?
# Will succeed if either:
# * configured for no password.
# * user has already entered password and sudo has not timed out the
# authentication yet.
if sudo -n -l &>/dev/null
then
CAN_SUDO=true
else
CAN_SUDO=false
fi
# Keep variable exports collated so it is easy to refer to, in
# order to see what variables are exported.
export IS_ROOT CAN_SUDO
Colours
Colour detection is done simply (possibly naïvely?) by checking if the terminal reports colour capability and testing if STDOUT (file descriptor 1
) is connected to a terminal in lib/colour-terminal.bash
:
declare -A colour_map=([black]=0 [red]=1 [green]=2 [yellow]=3 [blue]=4 \
[magenta]=5 [cyan]=6 [white]=7)
# Non-colour values
declare -A COLOURS=([reset]=$'\x1b[0m' [bold]=$'\x1b[1m' [dim]=$'\x1b[2m' \
[underlined]=$'\x1b[4m' [blink]=$'\x1b[5m')
# Foreground and background colours
for key in "${!colour_map[@]}"
do
COLOURS["fg_${key}"]="${v:-$'\x1b'}[3${colour_map[$key]}m"
COLOURS["bg_${key}"]="${v:-$'\x1b'}[4${colour_map[$key]}m"
done
# If colour mode is forced, or automatic and the terminal reports colour
# support and STDOUT(fd 1) is connected to a terminal (e.g. as opposed
# to a pipe).
if [[ ${CONFIG[colour]} = "true" ]] || \
( \
[[ ${CONFIG[colour]} = "auto" ]] && \
[[ "$(tput colors)" -gt 0 ]] && \
[ -t 1 ] \
)
then
COLOUR_SUPPORT=true
else
COLOUR_SUPPORT=false
# Set the colour array values to empty string, so they can be used
# without the author worrying about whether colour works.
for key in "${!COLOURS[@]}"
do
COLOURS[$key]=""
done
fi
# unset temporary variables to avoid environment pollution when sourced.
unset colour_map
# Keep exports together to easily see what this script intentionally exports.
export COLOUR_SUPPORT COLOURS
Functions
Two main functions are in lib/functions.bash
- usage()
, which displays a help message, and run_test()
, which will run a single test and return a pass (0
) or fail (test exit status) status code.
usage
This deliberately prints unicode characters in the help message, to aid users in seeing if the unicode characters are supported by their present font.
usage() {
local unicode_tick=$'\u2713'
local unicode_cross=$'\u2717'
cat - <<EOF
Usage:
$0 [-u|-U|--unicode|--no-unicode] [-c|-C|--colour|--no-colour] [-f|-F|--fatal-warnings|--no-fatal-warnings] [-hZ--help]
-u|--unicode: use characters ${unicode_tick}, ${unicode_cross} and ! to
report pass/fail/warnings (default)
-U|--no-unicode: use words [ ok ], [FAIL] and ([WARN])[warn] to report
pass/fail/(fatal)warnings
-c|--colour: force colour output
-C|--no-colour: force non-colour output
(default is to use colour if output is a terminal and reports colour support,
not otherwise)
-f|--fatal-warnings: warnings (e.g. test is not executable) are considered
failures
-F|--no-fatal-warnings: warnings (e.g. test is not executable) are printed
but not considered failures (default)
-h|--help: display this message and exit
EOF
}
run_test
run_test() {
local script="$1"
local test_name="$( basename "${script}" )"
# Support bash/python and c/php style comments
local comment_regex="\(#\|//\)"
local comment_marker="TEST_DESCRIPTION:"
local test_description=""
if grep -q "${comment_regex}${comment_marker}" "${script}"
then
# Challenge here is finding a delimiter for sed's substitution that
# is not likely to be used in the comment regex or marker (ruling
# out the usual candidates of /, # and ^. % seemed the best choice
# (it rules out using MATLAB and LaTeX comment markers in the regex)
test_description="$( \
grep -o "${comment_regex}${comment_marker}.*" "${script}" | \
sed "s%${comment_regex}${comment_marker}\\s*%%" \
)"
fi
# Make sure to update the list with any new output options, to keep
# this declaration the authoritative list of all that need setting.
local -A output=([good]="" [bad]="" [warn]="" [warn_fatal]="")
if [[ ${CONFIG[unicode]} = "true" ]]
then
output[good]="${COLOURS[fg_green]}${v:-$'\u2713'}${COLOURS[reset]}"
output[bad]="${COLOURS[bold]}${COLOURS[fg_red]}${v:-$'\u2717'}${COLOURS[reset]}"
output[warn]="${COLOURS[fg_yellow]}!${COLOURS[reset]}"
output[warn_fatal]="${COLOURS[bold]}${COLOURS[fg_red]}!${COLOURS[reset]}"
else
# Non-unicode feedback messages
output[good]="[ ${COLOURS[fg_green]}ok${COLOURS[reset]} ]"
output[bad]="[${COLOURS[bold]}${COLOURS[fg_red]}FAIL${COLOURS[reset]}]"
output[warn]="[${COLOURS[fg_yellow]}warn${COLOURS[reset]}]"
output[warn_fatal]="[${COLOURS[bold]}${COLOURS[fg_red]}WARN${COLOURS[reset]}]"
fi
# Only displays description (in brackets) after the test name if it is
# set.
echo "${COLOURS[fg_cyan]}>>${COLOURS[reset]} Running test" \
"${test_name}${test_description:+" (${test_description})"}..." >&2
if [ -x "${script}" ]
then
# Run the test script
${script}
local test_result=$?
if [[ $test_result -eq 0 ]]
then
echo "${output[good]} test passed." >&2
else
echo "${output[bad]} test failed with status ${test_result}." >&2
fi
else
local warn_msg="test is not executable. Unable to run."
if [[ ${CONFIG[fatal_warnings]} = "true" ]]
then
echo "${output[warn_fatal]} ${warn_msg}" >&2
local test_result=1
else
echo "${output[warn]} ${warn_msg}" >&2
local test_result=0
fi
fi
return ${test_result}
}
Tests
The tests in test.d
can be any executable file (e.g. script, binary program, etc.) that exits with a status of zero (0
) if the test passes and non-zero if it fails.
They can contain a comment (#
or //
style comments are supported) beginning with the text TEST_DESCRIPTION:
that provides a brief (to be printed alongside the test name, which is always the filename (so the can be no confusion about which test in test.d
it is), in the output). There must be no space between the comment marker and TEST_DESCRIPTION:
but can be optional spaces, which will be removed, between that and the description. For example, a bash or Python script might contain #TEST_DESCRIPTION: this test's description
.
For commands known to require elevated privileges, or run as a different user, the tests should use the variables IS_ROOT
and CAN_SUDO
to determine the appropriate mechanism for running those commands (i.e. can run directly or wrap the command in sudo
, respectively and whether to use su
or sudo
to become another user). It should not be assumed sudo
is even installed on a system. Tests should bypass anything requiring elevated privileges (printing a warning message to STDERR) if no elevation route is available - that is all tests should be runnable (even in limited form) by an unprivileged user (and the test pass in the absence of any failures).
distribution
Example test which performs no checks but prints the detected distribution:
#!/bin/bash
#TEST_DESCRIPTION: display detected distribution information
# Just display the distribution - never fails (we could check if it's recognised?)
echo "Detected ${DIST_DISTRIBUTION} (${DIST_FAMILY} family) version" \
"${DIST_VERSION} (major version number ${DIST_VERSION_MAJOR})."
privileges
Example test which performs no checks but prints the detected capabilities of the current user:
#!/bin/bash
#TEST_DESCRIPTION: summary of the current user's detected privileges
if [[ ${IS_ROOT} = "true" ]]
then
am_root="is"
else
am_root="is not"
fi
# XXX bad practice - reusing variable name in different case.
if [[ ${CAN_SUDO} = "true" ]]
then
can_sudo="can"
else
can_sudo="cannot"
fi
echo "Current user ${am_root} root and ${can_sudo} sudo."
fail
Example test that always fails:
#!/bin/bash
#TEST_DESCRIPTION: always fail - for testing
echo "This test deliberately fails!" >&2
exit 2
Main script
In full, the main script that runs each test and summarises the failures:
#!/bin/bash
# Standard bash safety - disable accidental globbing, no uninitialised
# variables, errors are fatal, errors in pipes cause pipe to error
set -fueo pipefail
# Work out where the script is located
my_path="$( dirname "$( realpath "$0" )" )"
# run_test() and usage() functions
source "${my_path}/lib/functions.bash"
# Process command line options and populates `config` variable
source "${my_path}/lib/command-line.bash"
# Colour support detection
source "${my_path}/lib/colour-support.bash"
# Priviledge escalation detection
source "${my_path}/lib/privilege-escalation.bash"
# DIST_ variables
source "${my_path}/lib/distribution-detection.bash"
# Do tests
declare -a failed_tests # Array to keep a list of failing tests
#enable globbing
set +f
# Allow the '*' to match nothing (in the case of no tests exist)
shopt -s nullglob
# Globs are expanded alphabetically (see Bash manual), so no need to
# do anything special to run them in sequence.
for test_ in "${my_path}"/test.d/*
do
# Disable null-globbing (default) and turn off accidental globbing
# again.
shopt -u nullglob ; set -f
if ! run_test "${test_}"
then
failed_tests+=("$( basename "${test_}" )")
fi
done
# Just in case the look didn't get entered - reinforce disabling
# accidental globbing.
shopt -u nullglob ; set -f
if [[ "${#failed_tests[@]}" -eq 0 ]]
then
echo "${COLOURS[underlined]}All tests passed.${COLOURS[reset]}"
exit 0
else
echo "${COLOURS[underlined]}Some tests failed.${COLOURS[reset]}"
echo "List of failed tests:"
for failure in "${failed_tests[@]}"
do
echo " * ${failure}"
done
# Use 2 to distinguish "some test failed" from "some unintended error
# occurred"
exit 2
fi
build-single-script
This script takes the main script, lib and all tests and bundles them into an archive that is prepended by a bash script that makes it a self-extracting and running script. This script takes a non-optional argument, which is the name of the script to create.
#!/bin/bash
# Standard bash safety - disable accidental globbing, no uninitialised
# variables, errors are fatal, errors in pipes cause pipe to error
set -fueo pipefail
# Work out where the script is located
my_path="$( dirname "$( realpath "$0" )" )"
# Helper to print usage message
usage() {
cat - <<EOF
Usage:
$0 [-h|--help|script_name_to_create]
script_name_to_create: name of the script that will be created (must not
exist)
-h|--help: display this message and exit
EOF
}
if [[ $# -ne 1 ]]
then
# No enough arguments
echo "Incorrect number of arguments: $#" >&2
usage >&2
exit 1
elif [[ $1 = '-h' ]] || [[ $1 = '--help' ]]
then
# Explicit help request
usage
exit 0
fi
script_name="$1"
if [[ -e ${script_name} ]]
then
echo "Error: ${script_name} already exists." >&2
exit 1
fi
# Script preamble - quoting the heredoc tag disables interpolations
cat - >"${script_name}" <<'END'
#!/bin/bash
# Standard bash safety - disable accidental globbing, no uninitialised
# variables, errors are fatal, errors in pipes cause pipe to error
set -fueo pipefail
# Helper to print usage message
usage() {
cat - <<EOF
Usage:
$0 [-h|--help] [-ttempdir|--tmpdir=tempdir] [--] [run_arguments]
-t|--tmpdir: Specify (as an argument to this option) the temporary
directory to extract to - must allow execution (i.e.
not be mounted "noexec"). Defaults to TMPDIR
environment variable, if set, or /tmp if not.
--: anything after this marker will be passed to the extracted
healthcheck script's command line options.
-h|--help: display this message and exit
EOF
}
# Process commandline
eval set -- $(getopt -l tmpdir:,help -o t:h -- "$@")
temp_dir="${TMPDIR:-/tmp}"
while [ $# -gt 0 ]
do
case "$1" in
-t | --tempdir)
temp_dir="$2"
shift # need an extra script - processing 2 arguments here
;;
-h | --help)
usage
exit 0
;;
--)
# End of arguments marker
shift
break
;;
esac
shift
done
# Processed our arguments and shifted the end of arguments marker -
# everything left is for the 'run' script
# Make a temporary directory for the files
out_dir="$( mktemp -d -p "${temp_dir}")"
# Extract the archive at the end of this script
echo "Extracting healthcheck to ${out_dir}..."
sed -e '1,/^__END__$/d' "$0" | tar -C "${out_dir}" -zx
# Run the run script
set +e # Want to tidy up, even if run fails
# Run the script with all remaining (not consumed by our options
# processing) arguments.
echo "Running healthcheck..."
"${out_dir}/run" "$@"
set -e # Die if this script errors, though.
# Tidy up the temporary files
echo "Deleting ${out_dir}..."
rm -rf "${out_dir}"
# End of script - archive follows
echo "All done."
exit
__END__
END
# Create the archive at the end of the file
tar -C "${my_path}" -zc run lib/ test.d/ >> "${script_name}"
echo "Created ${script_name}."
Here is what asking for the healthcheck script’s help with the self-extracting script looks like:
$ bash test -- -h
Extracting healthcheck to /tmp/tmp.UWWlnK7f8Y...
Running healthcheck...
Usage:
/tmp/tmp.UWWlnK7f8Y/run [-u|-U|--unicode|--no-unicode] [-c|-C|--colour|--no-colour] [-f|-F|--fatal-warnings|--no-fatal-warnings] [-h|--help]
-u|--unicode: use characters ✓, ✗ and ! to
report pass/fail/warnings (default)
-U|--no-unicode: use words [ ok ], [FAIL] and ([WARN])[warn] to report
pass/fail/(fatal)warnings
-c|--colour: force colour output
-C|--no-colour: force non-colour output
(default is to use colour if output is a terminal and reports colour support,
not otherwise)
-f|--fatal-warnings: warnings (e.g. test is not executable) are considered
failures
-F|--no-fatal-warnings: warnings (e.g. test is not executable) are printed
but not considered failures (default)
-h|--help: display this message and exit
Deleting /tmp/tmp.UWWlnK7f8Y...
All done.