Automatically shutting down server and NAS

At home we have a NAS (An APC Back-UPS BX1400UI) which protects our internet connectivity equipment, core network switch, one server and network-attached (NAS) storage device. The server runs the backup tool (BackupPC) that backs up all of my local and cloud systems, with the backups stored on an iSCSI volume on the NAS (with monthly “off-site” copies).

In the event of a power outage, the server and NAS are protected to stop them going down hard - the theory being that if one of the other systems does not survive a sudden power-failure, the latest backups should be safe. As the iSCSI connection goes over the network, I protected the switch (which also keeps our PoE wireless access points alive) so just adding the router to the UPS gifts us internet access during a power outage, at least for the 30 minutes or so before the UPS reaches 10% power and everything attached shuts-down.

The biggest power drains are the server and NAS, shutting these down early will both ensure they are shutdown cleanly and extend the battery runtime. Network UPS Tools(NUT) allows customising the threshold at which devices are signalled to shutdown, however this is a server settings so all clients are signalled at the same time. This is not ideal for me - I want to shutdown some clients early (the server and NAS) and the others late (router, network switch) and to shutdown some clients in sequence (server then NAS). Other than a custom script, I did not find a clear alternative way to set this up.

Shutting down the server

The first step is to script shutting down the server. There are several ways to do this, one is to login to the server and issue a poweroff command (which typically requires root privileges). An alternative, that I decided to go for, is to use the out-of-band management system to send an ACPI shutdown signal to the operating system.

The server is one of my HP MicroServers. I have previously setup automation to update the iLO’s SSL certificate, so I have experience working with the tools to do this which helps using them for further automation.

Setting up tools to support powering off

Like when I setup for the SSL certificates, I created a new user (I called it UPS) - this time with “Virtual Power and Reset” as its only privilege.
On my router, which is where the NUT server daemon runs, I installed the python-hpilo Debian package.
Create a configuration file for the HP iLO, with the username and password, following the same process as for updating the iLO’s SSL certificate (I put mine in /etc/nut/hp-ilo/, named after the iLO hostname and given permissions so only the nut user can read them).
Test we can get the power state of the server with the configuration file: ILO_HOST=my-ilo.home.domain.tld ; hpilo_cli -c /etc/nut/hp-ilo/${ILO_HOST}.ini $ILO_HOS T get_host_power_status

A script to shutdown the server

Once the above is setup and tested, I wrote a little script (/usr/local/sbin/nut-shutdown-hp-microserver) which shuts down the server and then waits a period for it to go down (or errors if the shutdown fails):

#!/bin/bash

ILO_CONFIG_PATH=/etc/nut/hp-ilo

usage() {
    cat - <<EOF
Usage: $0 ilo_hostname [...]

ilo_hostname: The DNS hostname or IP address of the iLO to shutdown.  Must have a corresponding configuration file in $ILO_CONFIG_PATH
EOF
}

if [[ -z $1 ]]
then
    usage
    exit 1
fi

# No error, unless we find one...
exit_status=0

while [[ -n $1 ]]
do
    ILO_HOST="$1"
    # Make sure we remove the current one from the list - we do not
    # want an infinite loop.
    shift

    # Check for a configuration file
    if [[ ! -e "$ILO_CONFIG_PATH/$ILO_HOST.ini" ]]
    then
        echo "No configuration file for $ILO_HOST!" >&2
        exit_status=1
        continue  # Skip to next host
    fi

    # Check it is not already off...
    POWER_STATE=$( hpilo_cli -c "/etc/nut/hp-ilo/${ILO_HOST}.ini" "$ILO_HOST" get_host_power_status | tail -1 )
    if [[ $POWER_STATE == "OFF" ]]
    then
        # Host being off already is not an error, so not changing exit status
        echo "Host $ILO_HOST is not on, not shutting down." >&2
        continue  # Skip to next host
    elif [[ $POWER_STATE != "ON" ]]
    then
        echo "Host $ILO_HOST is in unrecognised power state $POWER_STATE!" >&2
        exit_status=1
        continue  # Skip to next host
    fi

    # Trigger the shutdown
    echo "Shutting down $ILO_HOST..."
    hpilo_cli -c "/etc/nut/hp-ilo/${ILO_HOST}.ini" "$ILO_HOST" press_pwr_btn

    # Wait for shutdown to complete
    echo "Waiting for $ILO_HOST to power off..."
    counter=1
    okay=0
    # In my testing, it took around 24s to shutdown so 90s is a generous margin
    while [[ $counter -lt 90 ]]
    do
        sleep 1
        POWER_STATE=$( hpilo_cli -c /etc/nut/hp-ilo/${ILO_HOST}.ini $ILO_HOST get_host_power_status | tail -1 )
        if [[ $POWER_STATE == "ON" ]]
        then
            echo -n .
        elif [[ $POWER_STATE == "OFF" ]]
        then
            echo
            echo "$ILO_HOST powered off after $counter seconds."
            okay=1
            break
        else
            echo
            echo "$ILO_HOST entered unknown power state $POWER_STATE after $counter seconds." >&2
            exit_status=1
            break
        fi
        counter=$(( $counter + 1 ))
    done
    if [[ $okay -ne 1 ]]
    then
        echo "Shutdown failed!" >&2
        exit_status=1
    fi
done
exit $exit_status

Shutting down the NAS

This is more complicated, due to the introduction of CSRF tokens in the ReadyNAS’s web interface and the lack of a programmatic method of access.

An alternative method is to enable SSH access, and (according to the internet) you can then run rnutil rn_shutdown to shut it down. However, if you try to enable ssh in the web configuration you get a stark warning that Netgear may choose to refuse to provide warranty support if you enable SSH.

A CSRF token can be obtained from /admin/csrf.html, for example with curl:

curl -u "admin:admin_password" -k https://readynas_hostname/admin/csrf.html

Somewhere in the webpage that is returned will be a piece of javascript with a csrf token that can be used in subsequent requests:

<script type="text/javascript">
<!--
csrfInsert("csrfpId", "some_token_here");
//-->
</script>

My first instinct was to reach for Python at this point, and use the requests and Beautiful Soup libraries to parse the file but it is very straight-forward with a bit of sed to get the token:

curl -u "admin:admin_password" -k https://readynas_hostname/admin/csrf.html | sed -n 's/csrfInsert("csrfpId", "\([^"]\+\)");/\1/p'  

It can then be shutdown with this call (modified from a post in the netgear community forum):

csrf_token=$( curl -Ss -u "admin:admin_password" -k https://readynas_hostname/admin/csrf.html | sed -n 's/csrfInsert("csrfpId", "\([^"]\+\)");/\1/p' )
curl -Ss -u "admin:admin_password" -k https://readynas_hostname/dbbroker -H "crsfpid: $csrf_token" -H "X-Requested-With: XMLHttpRequest" --data '<?xml version="1.0" encoding="UTF-8"?><xs:nml xmlns:xs="http://www.netgear.com/protocol/transaction/NMLSchema-0.9" xmlns="urn:netgear:nas:readynasd" src="dpv_1445852944000" dst="nas"><xs:transaction id="njl_id_2269"><xs:custom id="njl_id_2268" name="Halt" resource-id="Shutdown" resource-type="System"><Shutdown halt="true" fsck="false"/></xs:custom></xs:transaction></xs:nml>'

A script to shutdown the ReadyNas

As with the server shutdown script, I created a configuration file (this time a netrc file) containing the username and passwords called /etc/nut/readynas.netrc. As this is a netrc file, it inherently supports storing multiple hosts credentials in a single file so there was no need to follow the per-host style used for the server. The basic format of each line of the netrc file is machine <hostname> login <username> password <password>, e.g. for the examples above it would be machine readynas_hostname login admin password admin_password.

I also created a dedicated user, again called ‘UPS’, for the purpose of shutting down the system but as the ReadyNas has no fine-grained access control my only option was to make this user an admin.

The script (/usr/local/sbin/nut-shutdown-readynas) is then relatively simple:

#!/bin/bash

NETRC_FILE=/etc/nut/readynas.netrc

usage() {
    cat - <<EOF
Usage: $0 readynas_hostname [...]

readynas_hostname: The DNS hostname or IP address of the ReadyNas to shutdown.
EOF
}

if [[ -z $1 ]]
then
    usage
    exit 1
fi

# No error, unless we find one...
exit_status=0

while [[ -n $1 ]]
do
    READYNAS_HOST="$1"
    # Make sure we remove the current one from the list - we do not
    # want an infinite loop.
    shift

    # Check it is on
    if ! nc -z -w 5 $READYNAS_HOST 443
    then
        echo "$READYNAS_HOST appears to already be off (or not exist?)." >&2
        continue  # Skip to next host
    fi

    # Obtain csrf token
    csrf_token=$( curl -Ss --netrc-file "$NETRC_FILE" -k https://$READYNAS_HOST/admin/csrf.html | sed -n 's/csrfInsert("csrfpId", "\([^"]\+\)");/\1/p' )

    # Issue shutdown
    result=$( curl -Ss --netrc-file "$NETRC_FILE" -k https://$READYNAS_HOST/dbbroker -H "csrfpid: $csrf_token" -H "X-Requested-With: XMLHttpRequest" --data '<?xml version="1.0" encoding="UTF-8"?><xs:nml xmlns:xs="http://www.netgear.com/protocol/transaction/NMLSchema-0.9" xmlns="urn:netgear:nas:readynasd" src="dpv_1445852944000" dst="nas"><xs:transaction id="njl_id_2269"><xs:custom id="njl_id_2268" name="Halt" resource-id="Shutdown" resource-type="System"><Shutdown halt="true" fsck="false"/></xs:custom></xs:transaction></xs:nml>' )
    echo $result | grep -q '<xs:response ref-id="njl_id_2268" status="success">'
    if [[ $? -ne 0 ]]
    then
        echo "An error occurred shutting down the ReadyNas, the response was: $result" >&2
        exit_status=1
        continue # Skip to next host
    fi

    # Wait for shutdown to complete
    echo "Waiting for ReadyNas to shutdown..."
    counter=1
    okay=0
    # In my testing, it took around 51s to shutdown so 180s is a generous margin
    while [[ $counter -lt 180 ]]
    do
        sleep 1
        if nc -z -w 5 $READYNAS_HOST 443
        then
            echo -n .
        else
            echo
            echo "$READYNAS_HOST appears to have shutdown after $counter seconds (is no longer contactable on web interface)."
            okay=1
            break
        fi
        counter=$(( $counter + 1 ))
    done
    if [[ $okay -ne 1 ]]
    then
        echo "Shutdown failed!" >&2
        exit_status=1
    fi

done
exit $exit_status

Automatic shutdown with NUT

Having scripted shutting down the server and NAS, the final piece in the puzzle is to automatically shut them down when the power fails. NUT has quite a few different configuration options, including running a script on certain events. This can, however, cause an issue in the event of a brown-out or very short lived interruption and NUT bundles a tool, upssched, which will smooth these out and can be configured to trigger a handler only after a certain period without power. Doing this is described in section 7.2 “The advanced approach, using upssched” of the NUT user manual.

The first step I did was to configure NUT to run upssched when an event occurs. This is done by setting NOTIFYCMD to be the upssched program and setting the EXEC flag on each event in /etc/upsmon.conf on the master (although I presume it would work anywhere, the master is guaranteed to remain up longest - or the clients would not be able to continue monitoring it):

NOTIFYCMD /sbin/upssched
NOTIFYFLAG ONLINE SYSLOG+EXEC
NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC
NOTIFYFLAG LOWBATT SYSLOG+WALL+EXEC
NOTIFYFLAG COMMOK SYSLOG+EXEC
NOTIFYFLAG COMMBAD SYSLOG+WALL+EXEC
NOTIFYFLAG REPLBATT SYSLOG+WALL+EXEC
NOTIFYFLAG NOCOMM SYSLOG+WALL+EXEC

Next I configured upssched, firstly creating a handler script (/usr/local/sbin/nut-upssched-handler) which will shutdown my servers:

#!/bin/bash

case $1 in

  onbatt)
    logger -t nut-upssched-handler "The UPS has been gone for awhile - shutting down servers and NAS devices"
    # Do the shutdown...

    # Shutdown servers
    /usr/local/sbin/nut-shutdown-hp-microserver my-servers-ilo.home.domain.tld
    # Check shutdown succeeded before shutting down NAS
    if [[ $? -ne 0 ]]
    then
      logger -p user.err -t nut-upssched-handler "Shutdown of servers failed, not proceeding to shutdown NAS"
      exit 1
    fi

    # Shutdown NAS
    /usr/local/sbin/nut-shutdown-readynas ready_nas_hostname.home.domain.tld
    ;;

  notify-*)
    event=$( echo $1 | sed 's/^notify-//; s/-/ /g' )
    logger -t nut-upssched-handler "Notification for event $event triggered."
    echo "UPS notification - event $event has been triggered." | mail -s "UPS $event" root@localhost
    ;;

  *)
    logger -t nut-upssched-handler "Unrecognized command: $1"
    ;;

esac

Then I created a secure directory for the lock and pipe files - note the comments in the manual and upssched.conf man pages regarding the need to secure these:

mkdir -p /var/run/nut/upssched
chown -R nut:nut /var/run/nut
chmod 750 /var/run/nut/upssched

And finally created the configuration file for upssched:

CMDSCRIPT /usr/local/sbin/nut-upssched-handler

# This sets the file name of the FIFO that will pass communications between
# processes to start and stop timers.  This should be set to some path where
# normal users can't create the file, due to the possibility of symlinking
# and other evil.
PIPEFN /var/run/nut/upssched/upssched.pipe
LOCKFN /var/run/nut/upssched/upssched.lock

# Trigger early shutdowns after being on battery for 15s
AT ONBATT * START-TIMER onbatt 15
AT ONLINE * CANCEL-TIMER onbatt

# Notifications
AT ONLINE * EXECUTE notify-online
AT ONBATT * EXECUTE notify-on-battery
AT LOWBATT * EXECUTE notify-low-battery
AT COMMOK * EXECUTE notify-communication-restored
AT COMMBAD * EXECUTE notify-communication-lost
AT REPLBATT * EXECUTE notify-replace-battery
AT NOCOMM * EXECUTE notify-unavailable

Testing

The final step is to test this - I tested the server and NAS shutdown scripts as I wrote them, so the only test remaining was the power-failure test. This I did by simply turning off the UPS plug at the wall - monitoring the system and mail log to ensure the notifications were logged and sent and then, after 15s, watching the server and NAS shutdown in sequence before I turned the power back on.