ZFS Monitoring and Alerts on Proxmox

I’m running zfs on Proxmox and no monitoring or alerts comes out of the box. Right now I’m running this script with cron to keep up with the health of my zfs pools. It’s located at ryanburnette/zfs_alerts.sh and was adapted from petervanderdoes/zfs_health.sh.

One valuable item gathered here is a list of what to check for:

  1. Condition
  2. Capacity
  3. Errors
  4. Scrub Expiration

The previous script quits checking if a problem is found. I am checking for everything separately. I’m also using curl to send emails via the Mailgun API.

I plan to rewrite this in Go.

#!/usr/bin/env bash

hostname=`hostname`
emaildomain=''
alertemail=''
mailgunapikey=''

# max capacity % before getting capacity alert
maxCapacity=80

# age of scrub (in seconds) before getting scrub alert
scrubExpire=1209600

function email {
  curl -s --user "api:${mailgunapikey}" \
    https://api.mailgun.net/v3/${emaildomain}/messages \
    -F from="zfs alerts <no-reply@${emaildomain}>" \
    -F to=${alertemail} \
    -F subject="$1" \
    -F text="$2"
  printf "\n"
}

currentDate=$(date +%s)
zfsVolumes=$(/sbin/zpool list -H -o name)
capacity=$(/sbin/zpool list -H -o capacity)
errors=$(/sbin/zpool status | grep ONLINE | grep -v state | awk '{print $3 $4 $5}' | grep -v 000)

for volume in ${zfsVolumes}
do
  condition=$(/sbin/zpool status ${volume} | egrep -i '(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover)')
  if [ "${condition}" ]; then
    email "zfs alerts ${hostname} check condition of ${volume}" "${condition}"
  fi
done

capacityExceeded=false
for line in ${capacity//%/}
do
  if [ $line -ge $maxCapacity ]; then
    capacityExceeded=true
  fi
done

if [ "$capacityExceeded" = true ] ; then
  email "zfs alerts ${hostname} capacity exceeded" "${capacity}"
fi

if [ "${errors}" ]; then
  email "zfs alerts ${hostname} drive errors" "${errors}"
fi

for volume in ${zfsVolumes}
do
  if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then
    echo "ERROR: You need to run \"zpool scrub $volume\" before this script can monitor the scrub expiration time."
    break
  fi
  if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then
    break
  fi

  ### FreeBSD with *nix supported date format
  #scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $15 $12 $13}')
  #scrubDate=$(date -j -f '%Y%b%e-%H%M%S' $scrubRawDate'-000000' +%s)

  ### Ubuntu with GNU supported date format
  scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $13" "$14" " $15" " $16" "$17}')
  scrubDate=$(date -d "$scrubRawDate" +%s)

  if [ $(($currentDate - $scrubDate)) -ge $scrubExpire ]; then
    email "zfs alerts ${hostname} volume needs scrub" "${volume} needs scrub"
  fi
done