I’m running zfs on Proxmox and no monitoring or alerts comes out of the box. Right now I’m running this script with cron to keep up with the health of my zfs pools. It’s located at ryanburnette/zfs_alerts.sh and was adapted from petervanderdoes/zfs_health.sh.
One valuable item gathered here is a list of what to check for:
- Condition
- Capacity
- Errors
- Scrub Expiration
The previous script quits checking if a problem is found. I am checking for everything separately. I’m also using curl to send emails via the Mailgun API.
I plan to rewrite this in Go.
#!/usr/bin/env bash
hostname=`hostname`
emaildomain=''
alertemail=''
mailgunapikey=''
# max capacity % before getting capacity alert
maxCapacity=80
# age of scrub (in seconds) before getting scrub alert
scrubExpire=1209600
function email {
curl -s --user "api:${mailgunapikey}" \
https://api.mailgun.net/v3/${emaildomain}/messages \
-F from="zfs alerts <no-reply@${emaildomain}>" \
-F to=${alertemail} \
-F subject="$1" \
-F text="$2"
printf "\n"
}
currentDate=$(date +%s)
zfsVolumes=$(/sbin/zpool list -H -o name)
capacity=$(/sbin/zpool list -H -o capacity)
errors=$(/sbin/zpool status | grep ONLINE | grep -v state | awk '{print $3 $4 $5}' | grep -v 000)
for volume in ${zfsVolumes}
do
condition=$(/sbin/zpool status ${volume} | egrep -i '(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover)')
if [ "${condition}" ]; then
email "zfs alerts ${hostname} check condition of ${volume}" "${condition}"
fi
done
capacityExceeded=false
for line in ${capacity//%/}
do
if [ $line -ge $maxCapacity ]; then
capacityExceeded=true
fi
done
if [ "$capacityExceeded" = true ] ; then
email "zfs alerts ${hostname} capacity exceeded" "${capacity}"
fi
if [ "${errors}" ]; then
email "zfs alerts ${hostname} drive errors" "${errors}"
fi
for volume in ${zfsVolumes}
do
if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then
echo "ERROR: You need to run \"zpool scrub $volume\" before this script can monitor the scrub expiration time."
break
fi
if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then
break
fi
### FreeBSD with *nix supported date format
#scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $15 $12 $13}')
#scrubDate=$(date -j -f '%Y%b%e-%H%M%S' $scrubRawDate'-000000' +%s)
### Ubuntu with GNU supported date format
scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $13" "$14" " $15" " $16" "$17}')
scrubDate=$(date -d "$scrubRawDate" +%s)
if [ $(($currentDate - $scrubDate)) -ge $scrubExpire ]; then
email "zfs alerts ${hostname} volume needs scrub" "${volume} needs scrub"
fi
done