Socat, netcat, nc, tcpserver and open source "moulé sous les aisselles".

In my last post, I explored Laurent Bercot's premisce that for making a piped based server requiring a script to only know how to talk on stin/stdout tcpserver we don't need no systemd.

I experimented with tcpserver, and it worked fine.

I also « stofled » (went my way trhough stack overflow's answers) the topic and re-discovered nc, netcat, socat possible alternatives to tcpserver.

Well, as a first, all these tools are great. But, one conquered my heart : socat.

First, all honests persons should state their bias : I am a linux user since 1993 (slackware, debian) and since systemd have got some freeBSD, openBSD, and devuan. I long for systems that are idiot friendly. A special kind : the one that are not afraid to read the documentation and go on the web page of the software to get the upstrem documentation.

For famlily reasons, my main battle-station is a 12 yo core i3 with linux mint so to say debian, hence software are rarely conforming with upstream vanilla since debian political commisars have got weired ideas about how to package software.

But first, before this long digression about a certain idea of free software, let's remind the core of the problem :

Making a shell script act as a server by reading from stdin, writing to stdout and having a magical stuff transforming the shebang in a multi-connection server.

An echo server would probably look like :
while [ 1 ]; do
    read a
    echo $a
    echo $( export ) # maybe the server can see our IP address in the environment variables set ?
done
And with telnet if you connect on the port this server listen to everything you write is repeated.

Let's review the « There Is More Than One Way To Do It » (Perl's (in)famous TIMTOWTDI vs python « one best way » state of mind).
 tcpserver 127.0.0.1 1234 ./echo.sh 
https://cr.yp.to/ucspi-tcp/tcpserver.html . A tool smelling of BSD's spirit of doing one thing and only one thing well.
Netcat's way. Well. They are 3 netcats !
  • The root of all nc «hobbit» netcat, that is not available anymore ;
  • the openBSD netcat fork (also called nc) providing more protocols ;
  • the NMAP fork providing the -e option that I need (also called ncat) ;
As a result you can only do the same as tcpserver with NMAP's netcat (aka ncat) :
ncat -nlvp 1234 -e echo.sh 
Where l means listen, p port, n stand for numeric dotted quad IP address, v is verbose, e stands for execve. Trust me explainshell.com does an awesome job at explaining command line arguments in a readable way better than me.

And then, there is socat's way :
socat        TCP4-LISTEN:1234,bind=127.0.0.1,fork,reuseaddr  EXEC:"./echo.sh" 
All of these servers are accessed by doing :
  • telnet localhost 1234
  • nc localhost 1234
  • /usr/local/bin/socat READLINE TCP:localhost:123
  • and in the previous post you can also roll your own client to have history with tcpclient


So far, all these tools DO what I want, and they all transmit the IP/PORT of the client with the same convention (TOOLS)_PEERADDR, (TOOLS)_PEERPORT where tools can be SOCAT, NCAT, TCPSERVER (check yourself).

So, all these tools are equals, benchmarking over, choosing the right tool is a question of taste and religion. Over.

Or is it ? ... my religion


I may love concision, but concision sometimes does not help memory. Socat raises above all else by being user friendly but not idiot friendly.

Socat has an awesome PATTERN :
 socat [opt] SOURCE_FAMILY:args,opts DEST_FAMILY:args,opts
This make myself hate this tool for being overly verbose, but loving it for saving my precious memory muscle. For instance, READLINE is a family of source/dest that is a wrapper around stdin/stdout to provide history and line editing out of the box, which is AWESOME when you test a server repetively.

Since DEBIAN, is still so special by not providing upstream packages in their vanilla flavour, when you try READLINE source with debian based distro you will have this cryptic error :
$socat    READLINE   TCP:localhost:1234
2024/05/16 10:14:10 socat[5947] E unknown device/address "READLINE"
Requiring you to visit the upstream provider of socat and enjoy a free software « moulé sous les aisselles à l'ancienne ».

I mean, no git, no reactive website, nothing fancy, first the changelog, a tarball with checksums you can download and a man page with examples.

Ah ! It reminds me so much of my early days on linux. And a golden nugget hidden in the not that obvious link to the repos.
Curated examples commented by the author (you also have in the man page)

I think it even beats openBSD amish's style by being feature rich and consistent.

So, because I WANTED my socat with READLINE I had to compile it, and it was a delight a ./configure for portability, and even though linux may not be the primary target I had very few warnings, few of them being scary.

Oh, and there is a test suite in bash (that I red of course) and it ... was nice to check the compiled software against the expectation of the author. I have a bug in the interaction between READLINE and openSSL.

And you see also an hidden nugget of a broker + fanin/fanout network pattern example in shell that makes me question my usage of ZMQ (I basically use ZMQ only for this).

At this level, I think my will for benchmarking went out of the technical path to become much more fandom.

I mean, it even has CHROOTING out of the box, pf integration, (very) basic ingres/outgres control (CDB), SSL tunneling, cork screwing (very specific options to pass ill configured firewall with UDP), STUN like features (another way of firewall piercing) explained in a concise but funny way.

It's like christmas before christmas, with nice gift falling from the sky with mnemonics strengthening my sysadmin usage of sockets (nodelay, lingering, addresse reuse and other options that are coined as everyone everywhere else).

socat is my new favourite tool because it has a learning curve that totally worths it. It is fitting my brain topology nicely, maybe not yours.

/me <3 gerhard at dest-unreach.org

Is systemd bloated ? A pubsub server in 200 lines of code in bash talking on stdout/stdin that does not require systemd

Two days ago I was reading this about a « new vendor locking » made in lehnart pottering (systemd) and it gave me the will to see how much the assertion of this Laurent Bercot was true.
The new vendor locking is about an old technique used in inetd.
It took me one day to decide what to code, and I began today using uscpi toolsuite.

To be honest, I am a little bit of a fanboy of DJ Bernstein. He is from the school of keep it simple stupid and ROCK HARD idiot proof. And ... I am an idiot of my kind. I just want to do server logic by talking ot stdin/stdout and I am pretty glad to see there exists a correct tool to connect stdion/stdout to a TCP socket.

I decided to add injury to the team of the « I need the latest shibazam* team » and to code in bash.

* shibazam being any of « memory safe », « cryptographically proven », « portable » ...

Doing a pubsub server in bash



a publish/subscribe server is a very simple concept : you have channels on which you write and concurrent readers who can read. It was first used for tickers of stock exchange.

A channel can be bijected to a file, and as long as you handle concurrent writing with an exclusive lock, nothing wrong can happen.

A TCP server is basically a REP/REQ Pattern, while true, you listen to what the user says and you answer and come back to listening.

The perfect job for a DISPATCH TABLE I use to bashize my makefile, hence, the code for which I could basically steal my own code.

I don't know what to say, because it actually worked as a skeleton (saying what it would do but actually doing nothing) below 2 hours flat after the beginning of the coding so I dare say using bash was neat and fun.

Sometimes, I have the feeling we over-engineer our solutions.

Securing the server ?



DJB gives some nice input out of the box with the possibility to add ingres and outgres network rules in a portable way (CDB). And, if you are (rightly) paranoid, and want to deploy it on the internet, I strongly advise to use the beauty of TCP/IP tunneling. Stunnel will add both private key authentication and ciphering without having to roll your own cryptography by tunnelling your clear text protocol in an SSL ciphered IP socket.

I really have the feeling modern days is about over-complexifying design to make money and justify our bullshit jobs. I know by having tried that this kind of solutions don't make it to production line because « it looks to simple and don't give warranties of being sure enough ».

Annexe

The code with a tad of explanation on how to use it in the "?)" case that took me 6 hours to code, SLOWLY. VERY SLOWLY.

Percolons ! Les simulations numériques à l'aide du monde réel en 200 lignes de code.

Je suis sûr que vous ne vous êtes jamais demander combien on pouvait débrancher sauvagement de routeur sur internet avant que les paquets n'arrivent plus à être routé ? Ou, quelle est la taille optimale de mouture (la taille à laquelle on moud le café) pour retirer le maximum d'arôme dans une cafetière à piston sans se faire exploser la cafetière (au propre comme au figuré).

Cette classe de problème est la percolation.

C'est un domaine dans lequel les solutions mathématiques exactes sont dures à calculer et pour lequel faire tourner des simulations aide. On va regarder ici une petite simulation de mon crû et aborder dans la foulée : Laué, les réseaux de Boltzman, Monté Carlo, et Galton.

Au début, venait un problème simple : cette impression que coder est déconnecté de la réalité, alors j'ai eu envie de détourner le summum de la branlette informatique en truc utile : le jeu de la vie.

J'ai un module python pour faire joujou (gof) et quelques gists dont un jeu de la vie en réseau héxagonal.

Pourquoi un réseau héxagonal et pas carré ?



L'intérêt d'un réseau hexagonal est qu'il est géométriquement plus régulier et moins déformé qu'un réseau carré utilisé pour le jeu de la vie. Selon Laué (mathématicien célèbre en cristallographie pour avoir étudier les impacts de la symétrie des réseaux sur les causalités) plus on a de groupes de symmétrie, mieux c'est.

Un réseau carré c'est L2, L4. Un réseau héxagonal c'est symmétrie d'ordre 2 (miroir), 3 (tiers de tours) et 6. Plus on a de symétries, moins on « diverge » du monde réel en introduisant du moirage.



l'Automate à état



Pour notre simulation on va introduire un automate à état simple dont le pseudo fonctionnement est :
Pour tout x de droite à gauche:
    pour tout y de haut en bas:
        Suis je vide ?
            au dessus de moi (2 choix) est-ce plein ?
                si oui prendre au hasard le contenu de la cellule en haut et le mettre dans la mienne
C'est encore plus simple que le jeu de la vie.

Une fois qu'on a fait ça, on utilise python-tk et on affiche des pseudos particules qu'on injecte en haut et on regarde comment elle tombe.

Si la pseudo-physique est bien faite, qu'on plante des clous virtuels à la sortie d'un flux de particule et qu'on les collecte « en bas » dans des « bins » (littérallement les bacs physiques de l'expérience de Galton avant d'être un terme de physique statistique quand on fait des histogrammes) ALORS je dois avoir une belle gaussienne qui se dessine.

Ceci est traité dans cette vidéo :


Résultat 1 : physique bolchévique et monté carlo



J'espère que vous me pardonnerez de planter aussi mal mes clous virtuels que réels, mais regardons LE premier résultat que j'ai en comparaison.


Le résultat est sans appel je suis : CRYPTO BOLCHÉVIQUE. Je code des abstractions dont les gaussiennes penchent à gauche ! Malheur de moi, j'ai la DGSI qui va débarquer si je ne deviens pas centriste républicain.

Avant de corriger : comprenons la géométrie du canvas : les x sont croissant de gauche à droite, les y croissants de haut en bas.

Ma simulation est normalement biaisées haut/bas, mais en scannant séquentiellement de droite à gauche, je favorise la chûte à gauche.

En simulation physique ce problème est connu dans les laboratoires, c'est la raison d'être de la méthode dite de Monté Carlo. On randomise pour casser les ordres qui n'existent pas.

Physique corrigée

Vu que python
random.randrange
est asbolument pas conforme à l'API de range, recodons là de manière saine
def randrange(x):
    c = list(range(x))
    shuffle(c)
    return c
Et remplaçons le
for i in range(x)
par
for i in randrange(i)
soit un mélange faisant disparaître l'anisotropie droite-gauche qui n'existe pas dans le monde réel et faisons retourner la simulation.
Hey, en méthodologie « Good enough », on peut s'arrêter là. Oki, la gaussienne est un peu ... trop phallique à mon goût, c'est mon coté Jul in Shape qui fait ses gainages à la boxe pour rester un adonis éternel.





PERCOLONS



Enfin, on peut s'intéresser à la mouture du café et internet.

Certains problèmes physique sont BIEN CHIANTS à calculer donc, il faut
  1. pouvoir s'épargner les calculs
  2. trouver les moyens de vérifier les calculs simplement
Et c'est là qu'on simule. Cette simulation imparfaite va nous permettre de commencer à PERCOLER.

Percoler pour un fluide, c'est passer à travers un réseau d'obstacle aléatoire. Comme par exemple de l'eau à travers des grains de café dans une cafetière à piston.

Si vos grains sont moutus, moulus? moudus? bref passés trop fin au moulin, ça vous explose à la yeule. Si vos grains sont trop gros, l'eau passe sans extraire le café. L'art de déterminer les bonnes tailles de grains et la bonne vitesse/pression d'eau est donc l'art de percoler. C'est pour ça qu'une machine à café s'appelle un percolateur.

Mais, ça touche aussi la conception d'internet.

Imaginez un réseau hexagonal de routeurs qui passe aléatoirement les paquets de droite à gauche, mais déterministiquement du plus court chemin du haut vers le bas, et cette simulation simulerait un inernet spécial.

Néanmoins, elle permet de bâtir une intuition à des questions comme : à combien de pourcent de nœuds détorioré le réseau peut survivre, quels sont les signes avant-coureur d'une trop forte dégradation (latence, débit, perte de paquets) ? Quelle est la topologie optimale pour fonctionner dans le mode le plus dégradé possible ?

Cette science de la percolation a incité internet à être structuré d'une manière hérétique pour les ingénieurs X/Ponts Télecom c'est à dire en étant peu étanche (très connecté) et partiellement stochastique aux frontières (lire la RFC BGP).

Internet est conçu pour avoir topoliguquement et dans ses choix d'algorithme de routage au frontière la résistance à la dégradation du réseau. On prétend qu'il a été conçu pour survivre à une attaque nucléaire globale.

Voilà à quoi ressemble un « run » de simulation :

Ce qu'il reste à faire (si j'étais pas fainéant)



Si vous êtes pas fainéant, vous faîtes des histogrammes sur des milliers de run. Le plus de simuls le plus l'incertitude diminue (vite) (test de Student).

Pour chaque valeurs de pertubations vous notez : la latence induite, les paquets perdus, la diminution totale ou partielle de flux et vous faîtes des histogrammes.

Normalement, vous allez voir apparaître une valeur de rupture abrupte où statistiquement le réseau passe de quasi 100% de proba de passant à quasi 0% passant qu'on appelle la valeur de percolation. Et pour cette valeur vous allez pondérer les métriques et voir une belle courbe en forme de sigma (une ... sigmoïde) qu'on appelle une courbe de transition. Ça ressemble à une réponse d'un transistor ? Et bien oui, un transistor est une application de la transition abrupte en percolation sauf qu'au lieu que ce soit des grains de matières macroscopiques ce sont des électrons qui sont impliqués, et la plage d'amplification est celle de la transition.

Voilà, des fois quand on fait trop d'info on en a marre de ne plus s'approcher du monde réel, alors une petite simulation physique ça détend.

Annexe : code final et screenshots

200 lignes de codes ! C'est queue de chie.

Why make a makefile? Can reproducible build ever be achieved again?

When I code I like to pride myself in doing code that works everywhere. If unit testing is a good way to go, the fundation of testing a tool chain does work the same everywhere is checking artefacts are the same.

I recently made a code for a sociogram and added at the top of my « make » (a bash script that does all assembling in a suppositly deterministic way) a claim that by using the same input you would have the same output.

Let's see if I lied by taking a snapshot of both the last frame of videos built on 2 different computers :




If the graph is the same, the topology is not. For something about building geometrical shape this may some questions.

First : why can't computer academics build graphviz, but applied physicist do ?

Graphviz is breaking an unspoken standard of academic computer programming : it is based on a probabilistic simulation with a lot of random in it. Basically you first layout the nodes randomly, and randomly you swap to nodes, count the numbers of edges crossing and keep if less edges crossed than before. The kind of dumb algorithm that works but INVOLVES RANDOMNESS.

Well, computer scientific DO reproducible build ! Don't they ? And RANDMONESS is non reproducible, isn't it ?

Hard Scientists (which exclude the litterature lovers called computer scientists) use PRNG when working, so we can reproduce our builds ?




Seems better :D But not perfect once we set the SEED of the random generator in the graphviz output (the only obvioous source of randomness in the code I control).

Let's wonder what kind of non deterministic element I introduced in my build ?

multi-processing/threading is non déterministic


Enough experience in coding will make you smell were danger is. But, I love danger so I knowingly wanted to be a GOOD hard scientist and make my CPU BURN to 100%. It is a stuff I learned to embrace in physics lab. So, I parallized my code (see bash exemple on how to to it with a snippet :
NB_CORE=$(getconf _NPROCESSORS_ONLN)
function pwait() {
    local nb_core=${1:-$NB_CORE}
    while [ $(jobs -p | wc -l) -ge $nb_core ]; do
        sleep 1
    done
}
for i in *dot; do
    $DOT -Tjpg "$i" > $( basename "$i" .dot).jpg &
    ppids+=( "$!" )
    echo -n .
    pwait ;
done
It does the same thing as python multiprocessing async apply : fork process in the background and wait for all the processes to finish before going on. And, it is clear by exploration of both videos that I miss frames, well, ffmpeg (involved in the process is quite explicit :

[swscaler @ 0x5580a8337840] [swscaler @ 0x5580a8375fc0] deprecated pixel format used, make sure you did set range correctly
[swscaler @ 0x5580a8337840] [swscaler @ 0x5580a8b9dd00] deprecated pixel format used, make sure you did set range correctly
So I put a sleep 10, and it helped ... but not enough, because well modern computing on linux is chaotic :
  • signal delivery is unreliable : signals may be missed or can be « acausal », a core can say : I finished while the kernel is still waiting for a file to be written
  • versions of software may differ even on 2 debian based distros (one is debian testing (hence outdated), the other is linuxmint current (less outdated)
So ... I did my possible to have the same result given the same parameters by fixing all I controled including seeds of PRNG to have the same results and suffice to say, that at the end of the day, EVEN FOR A SIMPLE SCRIPT deterministic « same results are impossible ».



It is fucking neither the same topology, NOR chronology. For a dynamic picture of a topology it kind of sucks terribly and should not be considered « good enough ».

It is good enough for a « side project », but not for a production ready project.

Famous last words


When I code in python, I have a freeBSD station, because BSDs are more boring than linuxes. However I can't play my favourite games with wine (Need for speed, Quake III, Urban Terror, various pinballs), hence the reason when I code for fun it's on my linux computers. (check my script to build a tailored freebsd qemu image on linux) but I dare say modern coding is not the « boring » activity I grew up with, hence my manic way of trying to make the most I can to ensure « maximum reproducibility in my power ». My power is about rationality, I give up when it comes to the entfshification of linux distributions that clearly went down the path of not caring, after all, you just need to spin a docker alpine to make stuff reproducible, don't you ?

I'm a single man army that want to code, not maintain a kubernetes cluster just for the sake of creating a « snowflake » of reproducibility that negates the purpose of coding. When I was in 1982 I could give my basic source code on a floppy and I was sure that another C64 would yield the same result given the same input. Nowadays, this is a wishful thinking.

The state of modern computing (especially on linux) is BAD. I even think of reconverting to COBOL on OS/360 to gain back a minimum of sanity back regarding my expectations.

I strongly advice devs to have put more effort on their assembling code (docker build, makefile, your own tool) than their « muscle code » (unit test included), because it's not in the code you control you might find the most surprising factor of loss of control, but in the lack of trust you should reasonably have from your hardware, your OSes and distributions. And it's a shift of normality, a new NORMAL, not an absolute NORMAL state.



Annexe

Here is the command line involved


SHOW=1 EDGE_SCALE=1 MIN_MAIL=20 \
  PERS_EDGE_SCALE=.2 BY_DAYS=50 SHOW=1 \
  THRESHOLD_ILOT=1 DOT="dot"  ./make very_clean movie
the « make » involved. And the main script. It DOES NOT EVEN MAKE 500 lines of code. It is small by any standards, even a « BASIC » program of the 1980s. I AM PISSED ! (╯°Д°)╯彡┻━┻

Addendum : and that's how I discovered the too silent fan stopped working silently without telling a thing in the log or the sensors complaining about heat on my laptop. Even hardware are becoming shitty. Addendum 2 : after some tinkering I discovered I nearly burnt my CPU thanks to debian removing fan control/cpufreqd ... AGAIN (╯°Д°)╯彡┻━┻ Now I must under clock this computer to make it compute anything because. Fuck debian !

Finir un projet scientifique de programmation en méthode « la rache » (à l'arrache)

La RACHE, solution globale de génie logiciel, est un ensemble de techniques, de méthodes et de bonnes pratiques décrivant - des spécifications à la maintenance - comment produire du logiciel dans des conditions à peu près satisfaisantes et approximativement optimales.

-- (h)IL(l)AR(E)
LA méthode « la rache » n'est pas à destination des codeurs de 42 ou des universités, ils ont tout vu, ils savent gérer la complexité et ils savent déjà tout. Elle est plus destinée à des codeurs de labo en science appliquées dont les cursus incluent les laplaciens, les stats, comment faire de la science, mais oublient les divagations monastiques de Giordano Bruno sur les Monades et la « programmation liquide ».

On va introduire une sous école de la rache : la programmation manuelle et ses déclinaisons ainsi que ses résultats :
  • la programmation à la pogne (dite la rache du sanglier) ;
  • la programmation aux ongles (dites la rache du chat) ;
  • la programmation en doigté (dites digitales par référence au français de numérique) tout en douceur
Déjà, les codeurs en science, même avec un doctorat arrivent toujours dans des labos dont la carrière est à la merci d'un mandarin dont les crédits sociaux et financiers dépendent de la réussite du doctorant précarisé : l'échec n'est pas une option si le doctorant veut sa carrière, et programmer selon les bonnes pratiques de l'entreprise privées certifiée ITIL/ISO 9001:2037/ISO 14000/ISO 27000 n'est pas une option : rien que pisser les formulaires vierge de chacune de ces méthodes consomme le budget alloué au codage, mais soyons honnête, il faut aussi les mêmes résultats.

Quel résultat est attendu ? Que ça pisse un résultat qui ébahit le chaland de manière reproductible dans un temps imparti humainement raisonnable (inférieur à l'infini).

On va conclure ma série sur les sociogrammes avec un exemple de code final et son utilisation (voir les annexes pour le code et le fichier d'assemblage).

leçon #0 si votre code ne pisse pas un artefact merveilleux, vous n'avez pas codé. Plus ça clignote et brille mieux c'est.

Voici l'artefact que nous voulons : une vidéo qui résume 100 000 mails de campagne 2017 sous la forme d'un film avec des carrés parfois bleus ciels (les anonymes) et de couleurs (les gens d'intérêts) et qui représente les flux de mails par des couleurs.
Le développeur de la rache commence toujours par le muscle : le code qui fait tout, mais c'est souvent la partie qui est la moins importante dans la production, la production est en essence musculaire, mais elle est accidentellement toujours chaotique lié à l'amour caché des développeur pour rendre votre vie misérable.

leçon #1 les VRAIS informaticiens sont rarement vos amis car ils préfèrent garder le savoir pour eux et se faire des couilles en or

Voir Annexe I : « le muscle ». Vous allez donc devoir assembler des consommables pour en faire votre rendu en passant sous les fourches codines de l'absurde seul(e) sans assistance. Pour ça, il est conseillé de maîtriser au plus la chaîne de production.

Ne faîtes pas comme les VRAIS informaticiens : diminuer votre besoin en outils extérieur au minimum, il vaut mieux une solution gruik codée mais robuste qui fait la job qu'une solution parfaite extérieure à laquelle vous comprenez peu.

leçon #2 : la programmation défensive académique ne vaut pas le codage « la rache » en mode parano.

Commençons par la sortie de « l'assembleur », le code magique planqué au fond de la cuisine qui fait TOUT ce que votre ami avec un BAC+12 vous dit de ne pas faire. Pour l'amadouer il est important de le camoufler sous un nom qui inspirera son respect « make » (comme un makefile). Ça fait comme un makefile, sauf que vous n'avez pas besoin d'un bac +12 pour l'écrire et le modifier car vous l'avez fait dans un truc simple que tout le monde maîtrise (DOS, basic, visual basic, powershell, bash, shell) ...

leçon #3 : camouflez votre partie crade derrière une belle façade et appelez ça un design pattern.

Peu importe que les couleurs soient moches, il faut des couleurs, et tout en haut la ligne de commande à passer à votre chef pour qu'il puisse faire une démo produit et dire « c'est moi qui l'ai fait » regarder comment je tape bien de la ligne de commande. Penser à choyer celui qui vous exploite, c'est penser à vos fesses.

Leçon Aveuglez de couleurs vos interlocuteurs pour qu'ils soient comme des biches sous les phares



Là, désolé, mais il faut rentrer dans le technique dit la rache du sanglier dont le slogan est : c'est pas parce qu'on code comme des sangliers qu'on doit coder comme des porcs. Pour coder comme un sanglier il faut être pragmatiste et violer tout les tabous de la programmation comme un curé dans au catéchisme.

  1. un script d'assemblage de cuisine informatique en shell unix commence TOUJOURS par set -e : ce serait con de poireauter des heures pour un assemblage qui a foiré au début
  2. entre stocker 264Mo d'archive pour plus tard, rappeler la ligne de commande que vous venez d'utiliser pour faire de la cuisine sachant que votre boss a le gros ordinateur qui tabasse et vous le petit est une bonne idée
  3. LES VARIABLES D'ENVIRONEMENTS NE SE VOIENT PAS, N'HÉSITEZ PAS À LES METTRE EN MAJUSCULES PARTOUT ET MONTRER EXPLICITEMENT OÙ ELLES SONT UTILISÉES
  4. votre ennemi c'est l'autre, pas vous, faîtes du code qui dit ce qu'il fait et fait ce qu'il dit, ça évite d'écrire des commentaires, gardez la possibilité de masquer tout ça pour ne pas vous faire voler votre savoir faire par des yeux indiscrets et dîtes « je vire les messages de débogage ça fait plus pro » ;
  5. sachez où vous perdez du temps ;
  6. votre code a peut être une partie qui peut potentiellement avoir une boucle infinie, n'hésitez pas à donner des petits signes de vie (je te regarde graphviz) ;
  7. vous êtes comme john snow, vous ne savez rien : paramétrez le plus possible de choses UTILES;
  8. si vous pouvez écouter de la musique ou ouvrir firefox pendant que vous assemblez c'est que vous avez encore de la ressource exploitable,
Vous noterez dans la sortie que le muscle du projet prend moins d'une seconde sur 20 minutes : la partie fun est en code est rarement la plus longue, vous passez votre temps à vous battre contre l'absurde.

Par exemple, dot (la commande par défaut de graphviz fait une simulation physique pour placer les nœuds dans le graph, mais si vous mettez trop de nœuds (recouvrement) le programme pédalera dans la semoule sans vous prévenir. Par contre, si vous utilisez sfdp il va dézoomer pour résoudre les conflits et se faisant va faire que convert qui charge le pixmap de l'image en mémoire faire une allocation trop coûteuse pour vos 4Gb de RAM si vous ne limitez pas le nombre d'instance en parallèle et lui faire cramer un max de CPU. Le monde est fait d'optimum sniffés à l'air du temps qui nécessite autant de souplesse que pour une auto-fellation. Je répète PARAMÉTREZ, vous me remercierez.

Avoir des assemblages reproductible est votre fil d'ariane dans un monde chaotique et délirant : répétez après moi c'est pas parce qu'on code comme des sangliers qu'on doit coder comme des porcs. Je ne vous entends pas : répétez le plus fort ! Voilà, je pense qu'avec ce fichier d'assemblage vous comprenez ce qu'est la programmation avec les ongles. Pour retirer la merde du sol quand je nettoie la cuisine, je pourrais allez chercher un couteau affûté pour virer les traces incrustées sur le sol, mais non. Je prétend utiliser mes ongles, mais en fait je vais chercher un solvant : de l'huile pour la graisse (si si si), du savon pour l'huile, de l'alcool pour les encres ... mais ça reste une belle image : parfois les ongles sont plus lents pour faire la job, mais sont l'outil qu'on a juste sous la main, et on gratte là où ça chatouille.

Passons à la programmation à la pogne (le 2é fichier nécessaire quand on la base de données déjà embasée de mails) Quelques astuces :
  1. LES VARIABLES D'ENVIRONEMENT EN MAJUSCULES EN DÉBUT DE FICHIERS, c'est plus facile à récupérer
  2. PAS DE NOMBRES MAGIQUES
  3. c'est plus facile de gérer un fichier qui inclut ses données que n fichiers de code et de données
  4. si vous passez les outils de typographies pour formatter le code comme en entreprise, c'est que vous avez le temps de faire des débats philosophiques, ARRÉTEZ ! Vous avez une vie et un entraînement de boxe française à pas louper.
  5. faîtes tout à la mimine si vous pouvez (genre la génération de fichiers graphviz)
  6. un simple try catch vaut parfois mieux qu'un code trop élaboré
  7. un script de mesure DOIT TOUJOURS INCLURE UNE LOGIQUE DE FILTRE PASSE BANDE, c'est souvent là où l'intelligence est cachée
D'ailleurs, parlant de passe bande, voilà à quoi ressemble les mêmes données quand on ne met que les VIP, on arrête de montrer une forêt et on fait comme Shannon le demande on montre les informations les plus pertinentes (H = k ln(S)), et surtout ça assemble plus vite.

Maintenant, passons à la programmation avec doigté, la partie du codage la plus fine qui permet de briller en entreprise et de gagner un max de caillasse.

Le plus important n'est pas de savoir faire mais de faire savoir Vous voyez un peintre en bâtiment est payé 10€/hr, mais un artiste qui fait la même chose si il écrit une dissertation autour de son geste ALORS d'un seul coup tout vaut 1000 fois plus cher.

Ce qui est important n'est vraiment pas de coder, mais de savoir en parler.

Voilà, j'espère que vous aussi êtes convaincu par la méthode « la rache » et allez me contacter en privé pour que je vous y forme (pour un max de blé).

Article garanti écrit à 1000% en méthode « la rache »

Nail programming : reinvented make in bash : you show your code, but if you don't tell how to build, you are a scammer

I always had a beef with so called open source project that ship their code but not their tooling for building.

You have code, but nothing is told about building the initial database, generating the doc, and how to fetch/build assets.
And that's exactly what I did in my previous blog posts about building a chronosociogram. But me I have an excuse : it's not ethical open source : it's wtf public license open source : MY FUN FIRST and no unpaid work hours for the benefit of parasits.

However, you can be as rebel as you want, you still need to build the code, but I do not build tool, I let them appear from « scratch programming ».

digression : the 50 shades of « hand programming »

I belong to the sect of the methodology « la rache ». « Programmer à la rache » (close to rush) is the main methodology appllied in France. Translated in US as « le » rush. Rush programming is an art that deserves a french article to distinguish it from the most common gruik programming (aka programming with your feet). Among this category they are the engineer that loves to do « assisted generative programming » either helped by IDE, « frameworks » or now the almighty Artifical Intelligence, and the rebel who prefer « hand crafted code ».

I live in Toulouse where rains comes in on flavor : raining or not. I came from the Vexin where rains comes as pluviote, crachin, drache, bruine, averse, giboulets ... Having the shades of perception for a simple task is useful when you don't went to be drenched by drache or overwhelmed by lack of defensive measures and sometimes, even in « la rache », we need to make things the right way to not drown oursleves in the complexity of our coding.

First is POGNE programming

It's how prototyping begins. With a clear view and a firm grasp of less than 100 sloc in a single NICE monolith.

Second is NAIL programming

Well, on the path of delivery you encounter difficulties you had not expected ... And instead of fetching your knife in the garage to scratch the spot on the ground you use your nails. Your code based gets disgusting, but YOU HAVE TOOLS to ease your job. But is hurts a little.

Last is mimine programming

Mimine is the gentle hand that comes to the rescue and make you spiral in building simple tooling for making your code manageable again.
As a craftman, you clear your workshop, sharpen your tools, clear the air and make your place ready to begin an another day of POGNE programming.

There is l'art (the academic way of building code) and la manière : your own personal sensitivity of « it works for me »®© packed in a few lines of code so that you have time to go to the savate boxe training (breaking knees the french elegant boxing way). La « Manière » is a pillar of « la rache » : don't go in another tool that you hate restepping the learning curve when you can brutally hack one.

Why you always need a makefile ?

Lol, if your code does one thing simply (like listing) I'm not sure you need it. :D

I needed a makefile because I introduced a 2 pass building for easing the life of my corei3 with 4Gb of RAM, which ... as a factor of serendipity helps building nice tools.

Let me explain it to you : using a dot file as a template is less expensive than REBUILDING the same dot file over each iterations... But since f strings/regexp are not my favourite in python, I did used perl to build the dot template generated from python to reinject it in the python script. An « n » stage build requires assets (artefacts) and sometimes your history forgot how you did it, and you too. So you need to create a makefile.

Good makefile tells you what they are doing, not only to understand when it fails but also, to help you build an intuition of where time is spent. It is as much an informative tool as a structuring tool.

What you want from a trivial makefile

Helping you when the day after a long focus on code your brain discarded every info that are touchy to remember : like how does your excited brain works. It includes :
  • in which order to do stuff
  • dependencies
  • parameters and API
  • wich stages can be resumed after an error in the stag
  • how to not go through a lot of useless stages when you lack time
  • Where you spend the most time when something changes
  • bash completion of course !

When you are an adept of « la rache », global states/variable are embraced like poison. At low dose they help, too much of them kills. In LARACHE you ALWAYS use global variable to avoid MAGIC NUMBERS BUT they MUST BE ON TOP of the code every time.

In LA RACHE we embrace universal key value passing from perl to python to bash to ffmpeg : for this we use a secret tool : non documented environment variables so that we can later choose what to expose.
It is a simple dispatch table called kind of recursively if you consider stack baded recursion as noble enough and it checks for artefact presence to call for depencies if required.
I am pretty proud of a « racherie » -the 3 perl-oneliners there- that actually transform the dot file into a f string for use a template for python. As a source of wisdom I dare say ; a free text written by an algorithm is always easier to parse with a regexp :D

Why does the making has more code than the main code ?



For the same reason in real life you often pass more time cleaning your environment before (and normally after) a task. A stuff manager never wants you to add in your timesheets because « it is non productive ». Well, here at home, I have no managers to tell me how to spend my time wisely. So ... I do whatever pleases me.

What is the interest of this ?

Practically, the debug messages gives me a clear intuition of where my computer passes more time. Intuition that I refined with printing dots when interesting. Thanks to this for instance I noticed
dot
and
convert
where 100% monocore, else I would have not parallelized part of the code according to my number of cores.
Also having an intuition of where time is taken can help see counter intuitive results. Like with sfdp. Not super fils de p..., the graphviz engine intended for speed. Except the algorithm uses scaling out to avoid overlaps efficiently making the converting waaaay slower. So all in all, sfdp + convert is slower and more prone to OOM than dot + convert. It is something you cannot guess if you mute all feedback. Call it visual profiling :D

And also, in makefile putting code is a major pain. When making an almost cascading model of tasks, involving few dependecies coding the logic (2hours) totally worth the fun of the process.

Having no code review demanding I begin my prototype with clever CLI libs is also nice. Environment variable have regain traction for argument passing (docker has got finally a good side), I can pass variable without having the craziness of getopt syntax to handle which make everything easy, including back and forth from the make and code.

Bash completion is fairly easy as a one shot :
complete -W "all still_images muscle backbone movie clean very_clean" ./make
of course I could let make have a completion function, but it does not worth the pain.
Makefiles and whatever flavour of the snake oil you are drinking is above all about focus. Separating your code that is complex and you want to focus on from the rest. Without this, juggling with your short term memory for building and debugging becomes hell. Of course, locally I use a version control software à la git, because, regression all bites us hard when the do. Especially the idiot forgetting to use a VCS.

Bref, I don't say makefiles are shitty, I say : I don't compile C here, my global needs are the same (reproducible builds) but my path is different since I use bash/perl/python/cli that are better invoked in a shell.

Being home is really the place where coding is nice and relaxing. I don't trust pro coders that refuses to touch a keyboard outside : how can they know their actual belief about style, dos and dont's, languages, frameworks are on par with reality ?

Building a Better chrono-sociogram : a radar picture of social interactions

Humans are like a flock of birds of the same feathers that have a natural tendancy to change their feathers a lot.

-- Coco Chanel while french kissing a gestapo officer ~1943 in Paris
What is wrong with sociograms ? They don't catch the way people change their loyalty. I think an election is interesting especially when it was the most massive turn cloak event ever seen in french politic.

So to solve the lack of tooling for this there is a tool : the chrono sociogram aka : social movie.

First was the persistent map

By feature, dot will correctly guess galaxies of connection that have existed over a long period of time. Hence we begin with a persistent map that is similar to all links set to 1 over a long period of time.

Why, because the same way radar maps are useless if cluttered with too much useless informations, you will not care about WHOM talked to WHOM but much more where storms happens, when.





Unreadable ? It is here in SVG.

Well since my corei3 (sorry for being poor and not being able to afford a M1 mac and a GPU for mining bitcoin) is taking 10 hours to generate 243 dot files), imagine how much time it takes to generate the following : a movie out of keeping the persistant image and only changing the color/size of the edge by templating THIS exact picture :D


What is it useful for ?


The first time I worked on this it was in 1997 in ENS as an intern knowing linux in a complex system lab, having a grant on influencing citizen for the « greater good of Political Agricole Commune ».

UE had failed miserably at convincing peasants (red necks) that there interest was in accepting UE money and being enslaved to debt for their life being. With this grant, the network of peasanry and exchange used, sociogram was built and « highly connected nodes » identified. Nowadays we call this « influencers ».

UE just flipped the opinion of a few selected one (ignoring the background of why THEY were selected) and it cascaded in convincing farmers.

This video illustrates like storms, both exchanges in social networks and when/how people turn coats. It is ACTUALLY the very heart of social media.

Hence when I claimed some originality on this work, I point blank lied. However, try to find people sharing their tooling, methodology and you will discover that your public taxes found public research lab which work you can't read and which tools you cant' use on the topic.

Graph problems are NP, exploring graphs require a lot OF CPU, how do meta/FB handle the network analysis I just done for a fraction of the CPU ?

Well, they don't observe the network, they also model it by rewarding people who ARE ALREADY INFLUENCERS in real life thanks to « side channel ». Ex : for publishing videos at the actual quality standards in video and sound processing you need WAY MORE MONEY and better equipment than for writing a blog post that can be done with a 12yo computers. There is a bias of selection of just using the main stream tools.

How can we avoid being influenced ?


Don't : learning of foreign cultures is nice. I love manwhas, mangas and folklore and foreign languages as much as any influencers do. At the opposite of them, though, I give no interest in fame.

What traps you in the web of influence is your the reinforcment of your own bias. Trsuting hyped talkers is a bias..

Have you tried talking to friend of yours about topic on which you disagree and state that you want to explore influence ? Life if polemic. You are better prepared to it by training like a boxer : every day in a pleasant non aggressive mindset like when going to the boxing training. (Yes, I love boxing).

Ivy league education prepare to this thanks to « concours d'éloquence », it's also if you notice the core of how slang culture treat the others.

Now that I finished a project I wanted to do for long, I think my sociogram time is other, and I may do a last topic on the making (and thus introduce my bash sort of makefile and how with perl oneliner I templatized a dot file to create a usable template for python because this trick is serverly disgusting (I mean FUN).

Real life stuff


All the spoke persons even though they were not the ones spied on of the government did have a very active involvments while being paid by tax payers money. It may be legal, it does not look moral when the same one advocates for better spending of the tax money. When you are receiving public money for one reason to support your expenditures to either be a public servant or a representant of the people, you don't work for private interests on the side. But well, we neither call french state l'Assiette au beurre (the plate of butter) for nothing, nor Paris Paname without a reason : it's because of this lingering kind of corruptions scandals that don't kind of motivate you to vote.

Next episode

The poor man's makefile in bash.

Annexe

Building a CHRONO SOCIOGRAM from real world data

My movies are but the capture of a fleeting moments of life that takes place in a 1:1 map of human relationships. The map is a useful simplification of the overwhelming complexity of a territory in which I strive to explore human emotions.

Ingmar Bergman
In previous episode (last 2 posts) I did explore HOW TO represent links between social animals to try to infer who knows who and how these animals organize themselves as a herd : it is a sociogram.

It's a map, and a bad one at this. You see as in H2G2 (hitchiker's guide to galaxy) it's not the map that make the story, but construction work that changes the map. To « prove » my point I will study real life data for which (ethical/legal disclaimer the victim of the data breach and THE CREAM OF ALL JOURNALISTS admitted there was nothing. The hacked team said it was an honeypot (which will prove to be quite true), and nothing of interest was there (which is quite false). Hence, if you want scandals or REVELATIONs, sorry to disappoint you, sociogram don't unfold secret conspiracies.

I must also say I am a sysadmin first. My soul burns when exposed to private email contents, so ... I love sociograms that are in the grey area : you look the enveloppe without reading the mail body.

In my database structure you saw I DO populate my database with the text_mail, and you guessed I did use real data since day 1. Coding out of abstraction is nice, but nothing beats real data to actually stress out your abstration and torture your tools, especially mails that are per RFC2822 the most annoying archive format ever to parse (that I love, I began as a sysadmin specialized in mail).

Without further ado I will show you the result and explain how it was built thanks to postgres FTS (word count does not count as reading email body, doesn't it ?).

Legal disclaimer

It is a sociogram of Person Of Interests from the macronleaks. I will define a POI as : either hacked or (my own definition) any public servant that by being a is submitted to article 14 of constitution Article 14
Tous les citoyens ont le droit de constater, par eux-mêmes (...) la nécessité de la contribution publique, (...), d'en suivre l'emploi.
The payroll of public servants being public contribution submitted to « devoir de réserve » (hence NOT INFLUENCING election), I am entitled to check if public money was indeed used with my constitutional rights at heart.

Aristotle said elective systems are the opposite of democracy : democracy is the ruling of the people by ANY people, the elections introduce a bias of selection of the candidate by chosing who has access to the tribune hence making vote a biased choice in a biased subset of the people. I will add Aristotle was an idiot.
Democracy is not a STATE function, it is a PATH function. Democracy is a path and a journey made of polemics of improvements. Elective systems represent a f**d up state of the situation but it doesn't prevent democracy to exists has long as «a good huguenot's belief » everyone can discuss as an equal. I BELIEVE IN REFORMS, but I don't go to the temple (protestants bore me to death) (cf Frankfurt school//Habermas).

So let's see if my constitutional rights to choose from the largest democratically large subset of population was respected or did public servants entitled to enforce the spirit of the Constitution meddled with my right to choose from a larger subset of candidate than during last elections. Is the delta « positive » in term of democracy ?

TL;DR : the reform I would want after this study would be for administration to make ALL enveloppe of public servants mail and private sector public. I would also like to have all public communication in a database I can query without rights of elected members/public servants to DELETE them.

Hence here is a real life sociogram of public servants//elected members during Macron's campaign.

Real life feedback

At the opposite of developpers I know when a tool sucks. This MAP SUX.

A good map is easily readable and does not clutter the screen with TOO MUCH INFORMATION. I find it nice and readable, because I have been trained to read the functions of a chip by looking at the silicon with a microscope, and compared to a SoC, this is peanuts. It's not even has complex has modulator/de-modulator (MODEM). However, I am also human. It's really a mess.

You may criticize people with nice words, I don't coat my words with sugar : it is plainly unusable because of H = k . ln(S). There is too much choices that are not relevant.


This is a sociogram from may collected on a 2 year span, hence it is making stuff that were fleeting moments persist over time. It has good sides : you can guess the galaxy of influences by looking at the constellations of contact near the person that were hacked : ministry of the finances, minister of public health, minister of royal importance (justice/interior), municipality of capital (a state in the state), and military industry.

It has the same value has a persistant vision : making a blurry map with a feeling of what is there, but no details.

Hence, we are gonna introduce another tool : that complements the sociogram with ... a CHRONO-SOCIOGRAM. MOVING PICTURES of SOCIOGRAM. V1 is ugly, but good enough to give an idea of what comes next: a freakinig chrono sociogram where nodes stand still and where I will only higlight temporarly the edges.

I will not sugar coat my words : it is still shitty as hell, BUT, at any given time a normal brain can often see what matters : who talked to whom when ? Which for shit diggers having access to the content of email would be the tool to ease the inquiry job of isolating the informational window in which to scrutinize contents. BUT I WILL NOT ! I am a PROUD SYSADMIN, muttafuckas !

But how did you found the POI in the first version btw ? Cheater ?

You can predict everything except the future, hence the reason in applied physics we make prediction on past data so we can foresee the results.

Niels Bohr
I don't cheat, I used science. I put in all members of government as POI and spoke person of it, knowing it was already public they worked in the campaign. But I discovered soon, knowing the future was not necessary.

Let me introduce : POSTGRESQL FULL TEXT SEARCH's FEATURES : not trying to add words in a corpus based on inference.

Once you google for regex postgres array you have a neat function ~#!@ that is very useful and you can fire the word count and select emails by frequency from a point.

Knowing that the president real email is ALWAYS in the form e2m@ it gives :
ml=# select * from ts_stat($$select to_tsvector('french', text_plain) from mail where  '^e2m.*' ~!@# any("from") order by date ASC$$) where word ~* '.*@.*' order by nentry desc;
                  word                   │ ndoc │ nentry 
─────────────────────────────────────────┼──────┼────────
 e2m@en-marche.fr                        │   18 │    115
 barbara.frugier@gmail.com               │   11 │    104
 clement.beaune@gmail.com                │   11 │    102
 ismael.emelien@en-marche.fr             │   19 │     36
 sylvain.fort@en-marche.fr               │    8 │     16
 benjamin.griveaux@en-marche.fr          │    6 │     12
 julien.denormandie@en-marche.fr         │    6 │     12
 brigitte.macron@en-marche.fr            │    6 │     12
 sibeth.ndiaye@en-marche.fr              │    6 │     12
...
So when president is in from these are the most likely emails to have been quoted.

Word count does not count has reading emails body content, for real especially with a filter on "@".

And, step by step, you build the sociogram by doing a top most quoted email using the first ring, the second ring and then you stop because it is already much...

Nothing magical, just stupid out of the box function from the database used in a perverted way.

Since these leaks are old news, you can then correlate with what you know.

  • barbara was indeed the PR
  • clement became a spokeperson of the President or a minister
  • ...
  • sylvain the ghost writer who wrote « revolution » for the president
Being as sleaky as google/Meta does not require brain. Just building tools.

At this point the code (embedded data not counted) for making « a movie » is still 100sloc and I code AT BEST 10 sloc a day. I'm a messy sloth.

Here is the code for build the movie
pushd out
for i in *dot; do dot -Tpng $i > $( basename $i .dot).png; done
for i in rec.????.png; do convert $i -resize 2000x1400! re.$i; done
rm output2.mp4
ffmpeg -framerate 1 -i re.rec.%04d.png -c:v libx264  -crf 30 -pix_fmt yuv420p output2.mp4 && firefox output2.mp4
popd 
And here is the code for building the sociograms :
#!/usr/bin/env python3

import os
import psycopg2
from datetime import date, datetime, timedelta
from archery import mdict

def int_env_default(var, default):
    return int(os.getenv(var) or default)

MIN_MAIL = int_env_default("MIN_MAIL",6, )
MAX_MAIL = int_env_default("MAX_MAIL",100)
WL_MIN = int_env_default("WL_MIN", 3)
CUT_SIZE = int_env_default("CUT_SIZE", 20)
DATE = os.getenv("DATE") or "2016-01-01"
END_DATE = "2017-05-01"
BY_DAYS = int_env_default("BY_DAYS",4) # 13x28 = 364 ~ 365.5
NB=1

end_date = date.fromisoformat(END_DATE)
date = date.fromisoformat(DATE)
td = timedelta(days=BY_DAYS/2)
td2 = timedelta(days=BY_DAYS/2)



                
def is_ilot(node:str, edge_dict:tuple) -> bool:
    """ilot == has only 1 link back and forth either in (from,) or (,to)"""
    count=0
    for edge in edge_dict.keys():
        if node == edge[1] or node == edge[0]:
            count+=1
        if count > 2:
            return False
    return True

patt_to_col =  dict({
    "e2m":"red",
    "emmanuel.macron":"red", 
    "emmanuelmacron":"red",
    "alexis.kohler" : "midnightBlue",
    "gabriel.attal" : "orange",
    "sejourne.stephane" : "grey15",
    "stephane.sejourne" : "grey15",
    "olivia.gregoire" : "darkOrange",
    "veranolivier":"green",
    "julien.denormandie" : "lightBlue",
    "sibeth.ndiaye" : "orange",
    "barbara.frugier" : "green",
    "cedric.o" : "purple",
    "gouv.fr" : "maroon",
    "snecma.fr" : "purple",
    "safran-group.fr" : "purple",
    "benjamin.griveaux":"blue",
    "laurent.bigorgne" : "darkBlue",
    "jean.pisani-ferry": "yellow",
    "luc.pisani-ferry": "yellow",
    "ismael.emelie" : "orange",
    #"jesusetgabriel.com" : "crimson",
    "gregoire.potton" : "lightGreen",
    "eric.dumas":"salmon",
    "alexandre.benalla" : "darkGreen",
    "pierre.person" : "darkBlue",
    "pierrperson" : "darkBlue",
    "quentin.lafay":"grey10",
    "fm.alaintourret" : "purple",
    "@paris.fr" : "orange",
    "langanne" :"pink",
 })
    

wl = lambda s : any(map(str.startswith, patt_to_col.keys() ,s))
def in_wl(mail : str):
    for l in patt_to_col:
        if mail.startswith(l) or mail.endswith(l):
            return l

def wl(pair: tuple):
    for l in patt_to_col:
        if in_wl(pair[0]) and in_wl(pair[1]):
            return patt_to_col[in_wl(pair[0])]
#assert wl(("jesusetgabriel.com", "jesusetgabriel.com")) == "crimson"

is_vip = lambda t:all(map(in_wl, t))
fn=0
while date < end_date:
    fn+=1
    direct=mdict()
    final = mdict()
    conn = psycopg2.connect("dbname=ml host=192.168.1.32 port=5432 user=jul  sslmode='require' ")
        
    with conn.cursor() as sql:
        sql.execute(f"""SELECT "to", "from" from mail where DATE BETWEEN '{date}' AND '{date+td}';""")
        while t := sql.fetchone():
            for fr in t[0]:
                fr=fr.strip()
                for to in t[1]:
                    to=to.strip()
                    if fr != to and fr and to:
                        direct += mdict({ (fr,to) : 1 })

    date += td2
    tk= list(direct.keys())
    def has_more_than_n_neighbour(email: str, n :int, final : dict):
        count = 0
        for k in final.keys():
            if len(set([email]) & set(k)):
                count+=1
                if count >n:
                    return True
        return False
            

    for k in tk:
    # dont modify a dict you iterate hence copy of keys
        if is_vip(k) and k not in final  and k[::-1] not in final: # or wl(k) and (k[1] != k[0] and k[1] and k[0] and k in direct and k not in final and k[::-1] not in final and k[::-1] in direct \
               
            final[k]=direct[k]
            final[k]+=direct.get(k[::-1],0)
        else:
            try:
                del(direct[k]) 
            except KeyError:
                pass
            try:
                del(direct[k[::-1]]) 
            except KeyError:
                pass


        
    tk= list(final.keys())

    for e in tk:
    # dont modify a dict you iterate hence copy of keys
        if not has_more_than_n_neighbour(e[0],NB,final) or not has_more_than_n_neighbour(e[1],NB,final):
            try:
                del(final[e]) 
            except KeyError:
                pass
            try:
                del(final[e[::-1]]) 
            except KeyError:
                pass



    color = "".join([  f"""{i[1]} pour {i[0]}{[", ",chr(0x0a),][(n%4)==3]}  """  for n,i in enumerate(patt_to_col.items()) ])
        

    conn.close()
    with open(f"out/rec.%04d.dot" % fn, "w") as f:
        title = f"""label="Sociogramme de {date} à {date + td} extrait des macron leaks orienté gouv.fr, personne d'intérêts (vert), victime du hacking (rouge), et président (bleu)\n
    entre [ {MIN_MAIL}, {MAX_MAIL} ] échangés \n
    plus gros liens au dessus de {CUT_SIZE} mails échangés entre interlocuteurs \n
    couleur par priorités selon les origines \n
    {color}
"""

        print("""
    graph Sociogramme { """ + f"""
    fontname="Comics sans MS"
    outputmode="nodesfirst"
    size=20
    label="Sociogramme de {date} à {date + td}"
    ratio=1.7
    labelloc="c"
    labelloc="t";
    """ + 
    "\n".join(['  "%s" -- "%s" [label=%d color=%s penwidth=%d ];' % (k[0],k[1],v, wl(k) or "black", 1 if v < CUT_SIZE and wl(k) != "red" else 4 ) for k,v in final.items()])
     + "}", file=f)


PS : I'm totally sure I saw during FSM 2005 or 2004 (20 years ago) exactly this about python commit during a conference. A lot of kudos to whoever can find the link and say : boooooh it's not new imposteur ! PPS : next blog post solving the graphviz variable problem that f*** ffmpeg with a fstring and being inventive in less than 20 sloc :D
I'm totally gonna dump the full stuff and use perl to make a template in f"" with perl :D trololololol

Building a sociogram from mail archives in python and postgres

TL;DR : « I know that as john snow, I know nothing, and it is not a problem »

Foreword : typing in CS is WRONG



People are different. I am, you are, he/she are, it are.

I don't talk about gender, I talk about mileage, life experiences and shit.

Me, I come from being a half thug, half educated (D&D bi-classed) including the local kingpin of forging fake documents for « école buissonière » (skipping school) with my early 90s computer, cheating with pctools, being the kingpin of a ring of cracked software from ranging from amiga to powerPC.

I learned coding not as an educated CS 101 but as a dyscalculic and dyslexic student in physics redeemed by his facility in foreign language especially if they have a germanic touch (french, english, german).

Foreign languages have taught me the power of composition of pre/post positions that are consistent like dé-ménager is opposite on en-ménager which is same as moving-out compared to moving-in.

Also, programming in physic is not the same as programming in CS. We don't have budget for programming hours, so our teacher don't care about elegance and high level programming or low memory consumption. They want easy to read and debug code that works fast. Also, physics has long resolved the typing debate : computer academics are just so high they don't get it.

See : if I have a unit in meter per seconds and I add kilograms, we want the code to crash.

Strongly typed language (including ADA or VHDL) don't have « symbolic consistency » but DATA TYPE consistency. And I will illustrate it in the code to come with the
is_ilot
function.

Mindset and tools : why I did not use overwhelmingly beautiful postgres features but wasted my CPU and memory doing the sociogram in python (and why not in Perl)

My problem is simple : imagine I have a sample database to have fun containing real life email and I want to make a graph out of it. Like an Xleaks from wikileaks, and I want like Meta, or google to infer who are the interesting persons interacting out of the thousands of interlocutors. So let's dump my SIMPLE sql schema :

DROP TABLE IF EXISTS public.mail;

CREATE TABLE IF NOT EXISTS public.mail
(
    filename text COLLATE pg_catalog."default" NOT NULL,
    "from" text[] COLLATE pg_catalog."default" NOT NULL,
    "to" text[] COLLATE pg_catalog."default" NOT NULL,
    subject text COLLATE pg_catalog."default",
    attachments json,
    text_plain text COLLATE pg_catalog."default",
    date date,
    message_id text COLLATE pg_catalog."default",
    thread_id text COLLATE pg_catalog."default",
    CONSTRAINT mail_pkey PRIMARY KEY (filename)
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.mail
    OWNER to jul;


A sociogram is a graph made of relationships (edges) between persons (nodes) with their strength expressed in number of occurences between persons.

I basically want
for each mail
   for the cartesion product of  to and from
       add an edge in graph
I use the language of building microchips I learned in applied physics of micro-electronic here. Not a computer stuff I cannot visualize in my f***d up brain.

The problem are « HUMANS ». They chat to a lot of persons but there are 2 kinds of persons : persons who linked other persons and persons who are just noises having a 1:1 relationship.

Analysing visually thousands of nodes in a graph is doable, but tiring. So we have to weed out nodes that are useless. People who are just leaves in the graph having only a bijective relationship.

Digression on the choice of tools


I love posgres, but I am dyslexic and don't use it often. I have as an only tool psql to build request and stack overflow to answer my questions involving recursive request on array to build a cross product.

Stack overflow is good at solving one problem. But composition of answers is tough, especially when I use reserved WORDS in postgres as column names. And you are gonna wonder why ?

MY OWN STRONG TYPING : name a thing by it's real name and PUT UNITS in their name if you can.

Also, physics taught me to avoid recursion at all costs because .... see point #1, stacked base thinking does the same for less bugs and head/p overflow (I come from an era where size of heap for call recursion where laughingly low on linux, C, Perl).

So I did my own code in python wasting memory and CPU for something I am well aware there was an elegant, faster solution in postgres, but early optimization or early result was a fast choice : I want for gruik coding without hesitation. Why did I used postgres ? Parsing email is a cost of HOURS. Getting the « to »s and « from »s time consuming. I totally amortized postgres for the use as a more efficient DATASTORE than the filesystem. No regrets here.

The code : KISS



Building a sociogram weeding out the « ilot » (edges connected to only one and only one other edge) is fairly easy once you coded the
is_ilot
function. It has been 2 full hours of work for a problem I never faced before applying academic knowledge of micro-electronic to a social graph.
import psycopg2
MIN_MAIL=20
MAX_MAIL=400
from archery import mdict

conn = psycopg2.connect("dbname=ml")

direct=mdict()
final = mdict()
    
with conn.cursor() as sql:
# cartestion product done the dumbest way possible
    sql.execute("""SELECT "to", "from" from mail""")
    while t := sql.fetchone():
        for fr in t[0]:
            for to in t[1]:
                direct += mdict({ (fr,to) : 1 })

                
def is_ilot(node:str, edge_dict:tuple) -> bool:
    """ilot == has only 1 link back and forth either in (from,) or (,to)"""
    count=0
    for edge in edge_dict.keys():
        if node == edge[1] or node == edge[0]:
            count+=1
        if count > 2:
            return False
    return True
    
tk= list(direct.keys())
for k in tk:
# weeding out pass 1
# dont modify a dict you iterate hence copy of keys
    if k in direct and k[::-1] in direct \
            and MAX_MAIL> direct[k]+direct[k[::-1]]>MIN_MAIL:
        final[k]=direct[k]
        final[k[::-1]]=direct[k[::-1]]
    else:
        try:
            del(direct[k]) 
        except KeyError:
            pass
        try:
            del(direct[k[::-1]]) 
        except KeyError:
            pass


    
tk= list(final.keys())
for e in tk:
# dont modify a dict you iterate hence copy of keys
# weeding out pass 2
    if is_ilot(e[0],final):
        try:
            del(final[e]) 
        except KeyError:
            pass
        try:
            del(final[e[::-1]]) 
        except KeyError:
            pass

    if is_ilot(e[1], final):
        try:
            del(final[e]) 
        except KeyError:
            pass
        try:
            del(final[e[::-1]]) 
        except KeyError:
            pass



conn.close()
# output for graphviz
print("digraph Sociogramme {")
print("\n".join(['"%s"->"%s" [label=%d];' % (k[0],k[1],v) for k,v in final.items()]))
print("}")


You will notice I « typed » the python function. It took me 20% of the time because I was still in postgres thinking of postfix typing (::array) and not in infix typing (list :) and it's because I had ... a bug to solve. At first I coded the function with one letter name as I usually do until I have a problem. I backtracked it, changed the name and put the typing annotations for fun but what really helped me was : the only strdoc in the code and naming. Remembering what was my purpose and what edges and nodes were. As soon as I named correctly the function, the inputs, and wrote the strdoc it was done magically and I laughed at how typing was missing the point.

a_dict could be a list of keys, a sparse matrix, a mutable mapping, a string. The type of data structure I use to represent a node or an edge is so varied typing does not help.

Here is a sample of the output before after the weeding out with
is_ilot

Before

After

Final word

I want this post to be an advisory to noobs like me to not care about academics but let the accident of programming not rebute them for going the way they see fit in a Keep It Simple Manner that « work for you »©®

Datastore and database : why it is a good idea to not confuse both.


I strongly advise to fast watch this video, since datastore and database, as well as hierarchical database are covered there.

Now let's come back in 2024 and wonder, well, is this still relevant.

I recently had fun taking a simple use case that made one postgresql contributor famous : Julian Assange Xleaks.

Should we put mails in database or keep the database as an index ?

As a foreward Python mail parsing is infamously not on par with Perl from which it has ported its libs. We will imagine we use a thin layer on top of it know as mail parser and that our mail comes from google, outlook, and thunderbird archives.

After all, mails are so important in our modern life that the use case of analysing mail with a database, or a nosql or .... is important.

My favourite « kind » of non database data-base are like LDAP hierarchical tree like datatype. They fit verey well with my aggregation lib for dict in python (archery a good way to shoot yourself an arrow in the knee). And of course, POSTGRES DOES IT, POSTGRES DOES EVERYTHING BETTER THAN ANYONE. (I intend it as a troll, but, in the bottom of my heart, postgresql and sqlite are my 2 favourite databases).

However mails ARE relationals thanks to from/to, and thread-id. If thread reconstruction may favour hirearchical databases, your unit of search will be mail and wanting to mail a many-to-many relationship like « find all mail where X and Y are linked.

One approach on parsing 1000th of email is relying on an SSD and brute forcing your way through mail every time. But, lol, I'm poor, I have an hard drive wich I fear for its life with all my precious photos of my family, so I prefer not too. Plus it's long. And that's when you remember surfing the paperspace and the liibrary analogy.

Database is supposed to be the cabinet in the entrance where you can find all the name of the books that are physically located in the library and where they are. Ex : an author may be a co-author (like Good omens from pratchett in Fantasy, in SF (yes Pratchett wrote SF), and in fantasy.

Fetching a book is long, the cabinet (database) helps you find the books fast in the datastore (the shelf of the library). A database is pretty much a tool for indexing slow retrieval data that you want to find fast.

In 2024, is it still relevant ?

I beg to argue yes in the famous realm of « it works for me this way »©® (famous saying of linus torvalds).

Hear me out : mail parsing is shit.

Imagine, you have an email : per RFC, you can have MORE THAN ONE BODY (a mail is an n-relationship between enveloppe and body) and BODIES can be either TXT OR HTML and more often than not (thks to the creativity of the artists) may bear different messages since SOMETIMES what you see matters, hence, the text body is litteral garbage if you want to make sense of the mail.

Mail per RFC 2822 can have attachments that can embedd attachments that refers to each others (often in an mixed peek and poke of both an arrayish and tree-ish data structure.

Ho ! Is it perfect for postgresql XML/JSON ? NO ! Recursive descending SQL request may exists, but joining on non deterministic disposition is begging for losing too much brain cell : it is not because you can do it that YOU SHOULD DO IT. Sometimes the most reliable to access attachments is not from
parser(mail).attachments.to_json
(that is half backed) but with
parser(mail).write_attachments
that renders embedded attachments inside the documents in a more reasonable readable way. And sometimes more often than a python guess, what you want is a heavy client for READING MAIL that embedds decade of wisdom on how to deal with broken norms.

Hence, practically, when you deal with mails what you want to store in the database what you want to store is :
  • path to the DATA-STORE (hence HARD DRIVE)
  • from
  • to (BCC, CC for me are tos too, since I want to draw sociograms)
  • message-id
  • subject
  • thread-id
  • DATE which is a nightmare since it's a clear hell to PARSE
  • and because experimenting with JSON is fun the attachments metadata object/dict (application type, content type, filename) as an JSON
  • and because FTS (full text search that can survive typo) is fun the FIRST text body (but you will miss when astute individuals will use more than one)
Legal archiver and geeks MIGHT want to add : the chain of mail servers, and all data concerning validation to try to detect signs of spoofing. Which gives us almost this :
CREATE TABLE IF NOT EXISTS public.mail
(
    filename text COLLATE pg_catalog."default" NOT NULL,
    "from" text[] COLLATE pg_catalog."default" NOT NULL,
    "to" text[] COLLATE pg_catalog."default" NOT NULL,
    subject text COLLATE pg_catalog."default",
    attachments json,
    text_plain text COLLATE pg_catalog."default",
    date date,
    message_id text COLLATE pg_catalog."default",
    thread_id text COLLATE pg_catalog."default",
    CONSTRAINT mail_pkey PRIMARY KEY (filename)
)

TABLESPACE pg_default;
This is ALREADY speeding up without hammering my hard drive all of the following funny use cases : This said, I have already fun using postgres and shell doing sociogram by first ranking people in to/from relationship above a threshold and then drawing the sociogram.

Here is a simple request to see who are the top senders

cat <<EOM  | psql ml jul | less
SELECT DISTINCT ON (times_seen, element) element
    ,COUNT(element) OVER (
        PARTITION BY element
        ) AS times_seen
FROM mail
    ,unnest("to") WITH ordinality AS a(element)
ORDER BY times_seen DESC, element DESC;
EOM
Making histograms about how much « to » you have per mail

psql ml -c ' 
WITH len as (SELECT cardinality("to") as cto from mail) 
SELECT 
	width_bucket(cto, 1, 43, 10) as bucket, 
    count(*) as freq 
FROM len 
GROUP BY bucket ORDER BY bucket;
'
# too lazy to put the min/max functions

 bucket | freq  
--------+-------
      0 | 41083
      1 |  9965
      2 |  1096
      3 |   227
      4 |   116
      5 |    88
      6 |    12
      7 |    14
      8 |     4
      9 |     2
     10 |     2
     11 |    26
(12 rows)


Making dot diagram of who speaks to whom in a one to one relationship (strong link) when it it more than 20 times :
( echo "digraph G {" ; 
	PSQL_PAGER="cat" psql ml jul -ta -c "SELECT \"from\"[1] || '->' ||  \"to\"[1]  FROM mail WHERE cardinality(\"to\") = 1 and cardinality(\"from\") = 1; " \
    | sort | uniq -c \
    | perl -ane 's/^\s+(\d+)  (.*)\-\>(.*)/"\2" -> "\3" [label="\1"];/g; print $_ if $1 > 20'  ; echo "}" ) \
    | dot -Txdot | xdot -
And that's already neat. By treating the envelope as an orthognal data space than datastore, you treat a new shit-ton of informations that are normally hidden in the mail that is already (too) full of information. I stop writing this post to come back having fun with my database while I leave the mail in the cold of the data-store :D

Addendum : And I was furious that the SQL was so complex for querying reciprocal relationships in arrays that I came (since my dataset is small and I have A LOT OF MEMORY (4Gb) with a neat more exact python/SQL/BASH/xdot solution that will cringe any SQL/bash/python purist (lol) but that is a pipe of simple operations anyone can understand.
#!/usr/bin/env bash
MIN_MAIL=40
TF=$( mktemp )
PSQL_PAGER="cat" psql ml jul -ta -c "
SELECT \"to\"[1] || '->' ||  \"from\"[1] 
FROM mail 
WHERE cardinality(\"to\") = 1 and cardinality(\"from\") = 1; " > $TF
python -c "from archery import mdict" || python -mpip install archery

python <<EOF | xdot -
import re
from archery import mdict
patter=re.compile(r"""^ (\S+)\->(\S+)$""")


direct=mdict()
final = mdict()
    
with open("$TF") as f:
    for l in f:
        try:
            (fr, to) = patter.search(l).group(1,2)
            direct += mdict({ (fr,to) : 1 })           
        except:
            pass

tk= list(direct.keys())

for k in tk:
# dont modify a dict you iterate hence copy of keys
    if k in direct and k[::-1] in direct and direct[k]+direct[k[::-1]]>$MIN_MAIL:
        final[k]=direct[k]
        final[k[::-1]]=direct[k[::-1]]
    else:
        try: del(direct[k]) 
        except KeyError: pass

        try: del(direct[k[::-1]]) 
        except KeyError: pass

print("digraph Sociogramme {")
print("\n".join(['"%s"->"%s" [label=%d];' % (k[0],k[1],v) for k,v in final.items()]))
print("}")


EOF



We need a more anti-clerical mindset in Information Technologies

I have recently made a rabbit hole in postscript, PDF, SVG, tk/Tcl (hence modern graphical UI), unicode and have worked intensively in web technologies. And I can tell you something : the modern clergy must die !

Talking like an ivy league student (https://mail.python.org/archives/list/python-ideas@python.org/thread/AE2M7KOIQR37K3XSQW7FSV5KO4LMYHWX/) is the freaking norm in open source software, and it ain't no secrets that expressing yourself like a « vulgar » street guy make any of your comment thrown to the thrash. I am « deftones » BORED of this, and I think this is becoming a major problem.

Clergymen are the knowledgable persons who from their ivory tower classify, gives names and structure the knowledges of the others. Charles Marche would say the « bourgeois » appropriate the « praxein » (how to) by transforming it into « doxein » (know how). But what else the bourgeois are famous for ?

Creating private property everywhere they can : on numbers (ISDN, IPv4, IPv6, DOI, OUI, PCI ids, BGP prefixes), on glyphemes (emojis, fonts), on sound (musics), fucking frequencies (5G, digital radio), time (RT GPS) ... Anything that physics can define as observable : even freaking colours can be patented !

Us from the « soute à charbon » (coal bunker) have in common with the boss of startup that we began coding with the raise of lowspec computers (C64, Z80, apple II (for the riches), CPC464, HP28/48 (kind of expensive at the time) ...). Why aren't we the boss ? Because someone need to do the job, and it won't be the bosses, the lawyers and their balls leakers (academic). Because, I vy League are the crucible of where these balls suckers are molded in this culture that sees THEIR views as universal.

Recently I have been raising to python and firefox my concern that I have NO LINE OF DEFENSE against something that seems irrationnal to me from the « coal bunker » : I cannot detect or correct when unicode bidir algorithm are presenting a < as a >
from unicodedata import normalize
from wsgiref.simple_server import make_server

def simple_app(environ, start_response):
    status = '200 OK'
    headers = [('Content-type', 'text/plain; charset=utf-8')]
    start_response(status, headers)
    ret = [("%s < =? %s" % (n, normalize(n,"\u202b>\u202b\n"))).encode("utf-8")
           for n in ("NFD", "NFC", "NFKD", "NFKC")]
    return ret

with make_server('', 8000, simple_app) as httpd:
    print("Serving on port 8000...")
    httpd.serve_forever()



Take a web browser and admire that < and > both appears and no stdlib/filters/normalizations exists for this.

What is a contract ? (Beware, I have huguenot origin the holy geist of capitalism runs through my blood)

If I show you a contract on screen, what should I consider the stuff on which agreement has been made ? What is shown or what is written ?

Since, we automatize contract processing it does not matter. What is written of course prevails on what is shown. And all of firefox/python coders said : « the holy unicode norm prevails ».

I don't want to pick a fight with unicode since only persons who actually have been experiencing first hand the madness of unicode editing bidir mixed strings will grock whatever I said.

Editing bidir string is like following the white rabbit under LSD of Alice in wonderland. The clergyman will tell you « YOU FAILED BY NOT FOLLOWING THE MYSTERIOUS ONE BEST WAY©® » (which basically boils down to use the latest iOS on the latest Mac Mx).

You see how my double punctuation marks have all a space in front ?

It is because my typography reflex are inherited from french language and I find it clearer. Typography is ONE of the many « ONE BEST WAY », and PEP8 enforces another one that contradicts mine. Making me consider « properly formatted » python code un-fucking-readable.

I have a master in physics and stumbled upon laplacians, differential and pretty complex equations, I have learned by strong burns that math are easier to read (FOR ME) when I deviate from the norm by adding MORE SPACES THAN REQUIRED.

You might rightly argue that what works for me IS NOT UNIVERSAL.

In greek there is a word for universal : catholic. It is a concept that has been heralded by chauvinistics genocidal embodiement of all what is wrong in our civilisation : « philosophs ».
Philosophs in Athens were richs sons of aristocrats and hiérarchs (holy persons) who thought that democracy was a threat to their unique Nature of « enlightened » piece of shit they were. They believed in absolute power of knowledge over everything and thought knowing maths was enough to have an opinion on every topic on earth : diet, pseudo-science, ruling a country, deciding who lives and who dies, what books to keep in libraries how to properly format writings and which one to throw away. They throw all books from the « sophists » who had the cardinal sin of opposing the « catholic » « one best way » in favor of the « πάντων χρημάτων ἄνθρωπον μέτρον εἶναι » (for all judgement man (he/she in its diversity) is the measure).

Hence, as a sophist I don't claim I am right, I claim I have the right to make a plea « as an equal human on the public place » for my case without being disregarded as « not following the « one best way ». Don't mistake me, I strictly don't want my views to be embraced by all as universal, nor disregard the others. I want to be talked as an equal on the agora.

All the answers I had were : we aren't gonna do anything because that's exactly WHAT THE NORM (unicode freaking hummongous mammothian riddled with more undefined behaviour than a C++ standard) SAID.

You know what, as a coder that HAVE to redact the web page that involves situation in which you click and it ties you to an immediate non reversible CONTRACT, I surely do advise you discuss with me.

This attitude towards discussion : always invoquing « the norm » first and refusing discussion with heterodox is undistinguishable from a clergy.
Fuck the norms, fuck the existing de-facto clergy of educated CS graduate, fuck the MPAA, the RIAA, unicode, and what the industry « wants » : let's talk about the freaking consequences for the « anthropos » (the animal called men and women) that must deal with the chaos of less than satisfyingly baked cathedral of norms.

And unicode is just the tip of the iceberg, PS, PDF, Tk and SVG deserves their OWN topic because, to me internet as a way to print knowledge and helps the « vulgus pecum » (the mass of persons not speaking ivy league english) share informations is being barred from exactly this by the norms themselves.

PDF and unicode are the opposite of Gutenberg press : neither a revolution of simplification of the essence of the written language (french happily lost a lot of letters and ligature in the process of Gutenberg Revolution), nor an ease of printing but they are WALLS of useless complexity that makes me regret the existence of daisy printers.
Modern web technologies is a dyschronia where the diffusion of information is being ruled by people with the Catholic church mindset opposing the ideals of Gutenberg. Modern web technology and computing is an effective dysfunction of what education should thrive for : emancipation of the mass by letting people exchange information the easiest they can and build their own tooling.