The simplest (unsecure) monitoring system you can do in 41 lines of code

Any distributed system is fated to become a monitoring system of itself at first.
-- unknown


What are a monitoring system and a distributed system ?

One is a fault tolerant system which aims at being decentralized (it is pretty much a design esthetic), and the other one is a system to detect faults.

A fault tolerant system by nature having to deal with fault sometimes pro-actively requires to be able to monitor faults. Hence, a distributed system always begins its life with a trunk of monitoring on which you hook watchdogs that can help your system adapt.
Couldflare anti DOS infrastructure is basically one of this system both of monitoring and distributed fault tolerance.
I pretty much suspect nefltix infrastructure to have been designed around a monitoring system.

Anyway our task here is to design the fundation of a distributed system where we have probes and consumers that can live without caring of each others.

The probe

The probe will read the CPU load on both a linux and a freeBSD

linux probe

$cat probe.sh 
#!/usr/bin/env bash
for (( i ; i<3;i++)); do
    echo $( hostname ):load_$(( 5 * ( i + 1) )):$( cat /proc/loadavg | cut -d " " -f $(( i + 1 )) ):GAUGE
done
01:29:14 11616 jul@badazz:~/src/pubsub
$./probe.sh 
badazz:load_5:0.97:GAUGE
badazz:load_10:1.17:GAUGE
badazz:load_15:1.11:GAUGE
The probe has the following «:» delimited format :
  • the id of the probe
  • the name of the data source
  • its value
  • its intended type in a rrd usage

The freebsd probe

[user@freebsd-14 ~]$ cat probe.sh 
#!/usr/bin/env bash
for (( i=0 ; i<3;i++)); do
        echo $( hostname ):load_$(( 5 * ( i + 1) )):$( sysctl vm.loadavg | cut -d " " -f $(( i + 3 )) ):GAUGE
done
[user@freebsd-14 ~]$ ./probe.sh 
freebsd-14.0_RELEASE:load_5:0.79:GAUGE
freebsd-14.0_RELEASE:load_10:0.81:GAUGE
freebsd-14.0_RELEASE:load_15:0.83:GAUGE

The monitor

The monitor takes the output of the probe and consume it. Here we make a simple data dumper of the input into there destinations (csv files and rrd).
#!/usr/bin/env bash
echo "$$" > mastermind.pid
trap "echo ping" SIGUSR1
function create_rrd() {
while read p; do
    local SOURCE=$1
    local DS=$2
    local DST=$3
    rrdtool create "$SOURCE.$DS.rrd"   \
    --step 1 \
    DS:x:$DST:1:0:99 \
   RRA:AVERAGE:0.5:1:864000 \
   RRA:AVERAGE:0.5:60:129600 \
   RRA:AVERAGE:0.5:3600:13392 \
   RRA:AVERAGE:0.5:86400:3660
}
while read p; do
    SOURCE= "$( echo $p | cut -d':' -f1 )"
    DS=     "$( echo $p | cut -d':' -f2 )"
    VALUE=  "$( echo $p | cut -d':' -f3 )"
    DST=    "$( echo $p | cut -d':' -f4 )"
    OFB="${SOURCE}.${DS}"
    [ -e "$OFB.rrd" ] || create_rrd $SOURCE $DS $DST
    NOW="$( date +'%s' )"
    # rrd insert
    rrdtool update "$OFB.rrd" $NOW:${VALUE}
    # csv insert
    echo $NOW $VALUE >> $OFB.csv
done

the local monitoring solution

You can create a local measure by doing
while [ 1 ]; do
    ./probe.sh | monitor.sh ;
    sleep .9;
done
We sleep only 1 second because my round robin archive (RRA) is set for a measure down to the second.

Distributing the measure


What if, the probes were using UDP4 broadcasting and the monitor were just ... lurking ?

There is a recipe for exactly this in socat documentation :
Example 2: Broadcast client and servers Broadcast server:
socat UDP4-RECVFROM:6666,broadcast,fork EXEC:hostname
This command receives packets addressed to a local broadcast address and forks a child process for each. The child processes may each send one or more reply packets back to the particular sender. Run this command on a number of hosts, and they will all respond in parallel. Broadcast client:
socat STDIO UDP4-DATAGRAM:192.168.10.255:6666,broadcast,range=192.168.10.0/24
This process transfers data from stdin to the broadcast address, and transfers packets received from the local network to stdout.
It's almost ready : on the computers where you want to deploy a probe just do :
socat UDP4-RECVFROM:6666,broadcast,fork EXEC:"./probe.sh"
To deploy a monitor do :
socat EXEC:"./monitor.sh" UDP4-DATAGRAM:192.168.1.255:6666,broadcast,range=192.168.1.255/24 
Now everytime monitor writes something on stdout, all the probes on the broadcast address will spill their data.

OOOps, the monitor does NOT write ... Ah ? Really ? Except if you kill it with a SIGUSR1 (trap "echo ping" SIGUSR1)... But wait ... we can actually launch as many monitor as with have available ips on the network ... But it ALL monitor talks we are gonna spam ourselves ...

Well, while we wait for a clean adaptative solution that monitors potential faults (net split for instance) to decide to trigger the signal of the clock we will stick a dumb one :
$ cat clock.sh 
#!/usr/bin/env bash
while [ 1 ]; do
    sleep 1
    MASTERID=$( cat mastermind.pid )
    kill -s SIGUSR1 $MASTERID
done
# clock.sh
And ... that's finished :D

No comments: