/proc/mdstat is another of the files that the node exporter exposes as metrics.

The Linux software RAID metrics are one of the more intricate metrics in terms of parsing due to /proc/mdstat being more suited to humans than machines, which you can get a sense of from the unittest fixture. For one of my home RAID1 arrays the node exporter produces:

# HELP node_md_blocks Total number of blocks on device.
# TYPE node_md_blocks gauge
node_md_blocks{device="md0"} 3.3538048e+07
# HELP node_md_blocks_synced Number of blocks synced on device.
# TYPE node_md_blocks_synced gauge
node_md_blocks_synced{device="md0"} 3.3538048e+07
# HELP node_md_disks Number of active/failed/spare disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md0",state="active"} 2
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0

node_md_disks is the primary metric of interest as you will want to know if there are any failed disks, and depending on your the setup you may also know that there must be a minimum number of spares and/or active devices. A RAID1 array with only one active disk is not exactly in prime health after all.

node_md_state indicates if the array is recovering using a spare, resyncing after an event such as an unclean shutdown, active and healthy, or disabled and inactive. When you are recovering/resyncing, node_md_blocks and node_md_blocks_synced can tell you how the array getting synced back up is progressing (and if /proc/sys/dev/raid/speed_limit_max may need a tweak). There are not something to alert on, but could be useful for graphing rather than having to eyeball mdadm --detail, and also to get some of the history of the array.

 

Need help monitoring hardware? Contact us.