Any distributed system is fated to become a monitoring system of itself at first.
-- unknown
What are a monitoring system and a distributed system ?
One is a fault tolerant system which aims at being decentralized (it is pretty much a design esthetic), and the other one is a system to detect faults.
A fault tolerant system by nature having to deal with fault sometimes pro-actively requires to be able to monitor faults. Hence, a distributed system always begins its life with a trunk of monitoring on which you hook watchdogs that can help your system adapt.
Couldflare anti DOS infrastructure is basically one of this system both of monitoring and distributed fault tolerance.
I pretty much suspect nefltix infrastructure to have been designed around a monitoring system.
Anyway our task here is to design the fundation of a distributed system where we have probes and consumers that can live without caring of each others.
The probe
The probe will read the CPU load on both a linux and a freeBSDlinux probe
$cat probe.sh #!/usr/bin/env bash for (( i ; i<3;i++)); do echo $( hostname ):load_$(( 5 * ( i + 1) )):$( cat /proc/loadavg | cut -d " " -f $(( i + 1 )) ):GAUGE done 01:29:14 11616 jul@badazz:~/src/pubsub $./probe.sh badazz:load_5:0.97:GAUGE badazz:load_10:1.17:GAUGE badazz:load_15:1.11:GAUGEThe probe has the following «:» delimited format :
- the id of the probe
- the name of the data source
- its value
- its intended type in a rrd usage
The freebsd probe
[user@freebsd-14 ~]$ cat probe.sh #!/usr/bin/env bash for (( i=0 ; i<3;i++)); do echo $( hostname ):load_$(( 5 * ( i + 1) )):$( sysctl vm.loadavg | cut -d " " -f $(( i + 3 )) ):GAUGE done [user@freebsd-14 ~]$ ./probe.sh freebsd-14.0_RELEASE:load_5:0.79:GAUGE freebsd-14.0_RELEASE:load_10:0.81:GAUGE freebsd-14.0_RELEASE:load_15:0.83:GAUGE
The monitor
The monitor takes the output of the probe and consume it. Here we make a simple data dumper of the input into there destinations (csv files and rrd).#!/usr/bin/env bash echo "$$" > mastermind.pid trap "echo ping" SIGUSR1 function create_rrd() { while read p; do local SOURCE=$1 local DS=$2 local DST=$3 rrdtool create "$SOURCE.$DS.rrd" \ --step 1 \ DS:x:$DST:1:0:99 \ RRA:AVERAGE:0.5:1:864000 \ RRA:AVERAGE:0.5:60:129600 \ RRA:AVERAGE:0.5:3600:13392 \ RRA:AVERAGE:0.5:86400:3660 } while read p; do SOURCE= "$( echo $p | cut -d':' -f1 )" DS= "$( echo $p | cut -d':' -f2 )" VALUE= "$( echo $p | cut -d':' -f3 )" DST= "$( echo $p | cut -d':' -f4 )" OFB="${SOURCE}.${DS}" [ -e "$OFB.rrd" ] || create_rrd $SOURCE $DS $DST NOW="$( date +'%s' )" # rrd insert rrdtool update "$OFB.rrd" $NOW:${VALUE} # csv insert echo $NOW $VALUE >> $OFB.csv done
the local monitoring solution
You can create a local measure by doingwhile [ 1 ]; do ./probe.sh | monitor.sh ; sleep .9; doneWe sleep only 1 second because my round robin archive (RRA) is set for a measure down to the second.
Distributing the measure
What if, the probes were using UDP4 broadcasting and the monitor were just ... lurking ?
There is a recipe for exactly this in socat documentation :
Example 2: Broadcast client and servers Broadcast server:It's almost ready : on the computers where you want to deploy a probe just do :socat UDP4-RECVFROM:6666,broadcast,fork EXEC:hostnameThis command receives packets addressed to a local broadcast address and forks a child process for each. The child processes may each send one or more reply packets back to the particular sender. Run this command on a number of hosts, and they will all respond in parallel. Broadcast client:socat STDIO UDP4-DATAGRAM:192.168.10.255:6666,broadcast,range=192.168.10.0/24This process transfers data from stdin to the broadcast address, and transfers packets received from the local network to stdout.
socat UDP4-RECVFROM:6666,broadcast,fork EXEC:"./probe.sh"To deploy a monitor do :
socat EXEC:"./monitor.sh" UDP4-DATAGRAM:192.168.1.255:6666,broadcast,range=192.168.1.255/24Now everytime monitor writes something on stdout, all the probes on the broadcast address will spill their data.
OOOps, the monitor does NOT write ... Ah ? Really ? Except if you kill it with a SIGUSR1 (trap "echo ping" SIGUSR1)... But wait ... we can actually launch as many monitor as with have available ips on the network ... But it ALL monitor talks we are gonna spam ourselves ...
Well, while we wait for a clean adaptative solution that monitors potential faults (net split for instance) to decide to trigger the signal of the clock we will stick a dumb one :
$ cat clock.sh #!/usr/bin/env bash while [ 1 ]; do sleep 1 MASTERID=$( cat mastermind.pid ) kill -s SIGUSR1 $MASTERID done # clock.shAnd ... that's finished :D
No comments:
Post a Comment