Finir un projet scientifique de programmation en méthode « la rache » (à l'arrache)

La RACHE, solution globale de génie logiciel, est un ensemble de techniques, de méthodes et de bonnes pratiques décrivant - des spécifications à la maintenance - comment produire du logiciel dans des conditions à peu près satisfaisantes et approximativement optimales.

-- (h)IL(l)AR(E)

LA méthode « la rache » n'est pas à destination des codeurs de 42 ou des universités, ils ont tout vu, ils savent gérer la complexité et ils savent déjà tout. Elle est plus destinée à des codeurs de labo en science appliquées dont les cursus incluent les laplaciens, les stats, comment faire de la science, mais oublient les divagations monastiques de Giordano Bruno sur les Monades et la « programmation liquide ».

On va introduire une sous école de la rache : la programmation manuelle et ses déclinaisons ainsi que ses résultats :

la programmation à la pogne (dite la rache du sanglier) ;
la programmation aux ongles (dites la rache du chat) ;
la programmation en doigté (dites digitales par référence au français de numérique) tout en douceur

Déjà, les codeurs en science, même avec un doctorat arrivent toujours dans des labos dont la carrière est à la merci d'un mandarin dont les crédits sociaux et financiers dépendent de la réussite du doctorant précarisé : l'échec n'est pas une option si le doctorant veut sa carrière, et programmer selon les bonnes pratiques de l'entreprise privées certifiée ITIL/ISO 9001:2037/ISO 14000/ISO 27000 n'est pas une option : rien que pisser les formulaires vierge de chacune de ces méthodes consomme le budget alloué au codage, mais soyons honnête, il faut aussi les mêmes résultats.

Quel résultat est attendu ? Que ça pisse un résultat qui ébahit le chaland de manière reproductible dans un temps imparti humainement raisonnable (inférieur à l'infini).

On va conclure ma série sur les sociogrammes avec un exemple de code final et son utilisation (voir les annexes pour le code et le fichier d'assemblage).

leçon #0 si votre code ne pisse pas un artefact merveilleux, vous n'avez pas codé. Plus ça clignote et brille mieux c'est.

Voici l'artefact que nous voulons : une vidéo qui résume 100 000 mails de campagne 2017 sous la forme d'un film avec des carrés parfois bleus ciels (les anonymes) et de couleurs (les gens d'intérêts) et qui représente les flux de mails par des couleurs.

Le développeur de la rache commence toujours par le muscle : le code qui fait tout, mais c'est souvent la partie qui est la moins importante dans la production, la production est en essence musculaire, mais elle est accidentellement toujours chaotique lié à l'amour caché des développeur pour rendre votre vie misérable.

leçon #1 les VRAIS informaticiens sont rarement vos amis car ils préfèrent garder le savoir pour eux et se faire des couilles en or

Voir Annexe I : « le muscle ». Vous allez donc devoir assembler des consommables pour en faire votre rendu en passant sous les fourches codines de l'absurde seul(e) sans assistance. Pour ça, il est conseillé de maîtriser au plus la chaîne de production.

Ne faîtes pas comme les VRAIS informaticiens : diminuer votre besoin en outils extérieur au minimum, il vaut mieux une solution gruik codée mais robuste qui fait la job qu'une solution parfaite extérieure à laquelle vous comprenez peu.

leçon #2 : la programmation défensive académique ne vaut pas le codage « la rache » en mode parano.

Commençons par la sortie de « l'assembleur », le code magique planqué au fond de la cuisine qui fait TOUT ce que votre ami avec un BAC+12 vous dit de ne pas faire. Pour l'amadouer il est important de le camoufler sous un nom qui inspirera son respect « make » (comme un makefile). Ça fait comme un makefile, sauf que vous n'avez pas besoin d'un bac +12 pour l'écrire et le modifier car vous l'avez fait dans un truc simple que tout le monde maîtrise (DOS, basic, visual basic, powershell, bash, shell) ...

leçon #3 : camouflez votre partie crade derrière une belle façade et appelez ça un design pattern.

Peu importe que les couleurs soient moches, il faut des couleurs, et tout en haut la ligne de commande à passer à votre chef pour qu'il puisse faire une démo produit et dire « c'est moi qui l'ai fait » regarder comment je tape bien de la ligne de commande. Penser à choyer celui qui vous exploite, c'est penser à vos fesses.

Leçon Aveuglez de couleurs vos interlocuteurs pour qu'ils soient comme des biches sous les phares

Là, désolé, mais il faut rentrer dans le technique dit la rache du sanglier dont le slogan est : c'est pas parce qu'on code comme des sangliers qu'on doit coder comme des porcs. Pour coder comme un sanglier il faut être pragmatiste et violer tout les tabous de la programmation comme un curé dans au catéchisme.

un script d'assemblage de cuisine informatique en shell unix commence TOUJOURS par set -e : ce serait con de poireauter des heures pour un assemblage qui a foiré au début
entre stocker 264Mo d'archive pour plus tard, rappeler la ligne de commande que vous venez d'utiliser pour faire de la cuisine sachant que votre boss a le gros ordinateur qui tabasse et vous le petit est une bonne idée
LES VARIABLES D'ENVIRONEMENTS NE SE VOIENT PAS, N'HÉSITEZ PAS À LES METTRE EN MAJUSCULES PARTOUT ET MONTRER EXPLICITEMENT OÙ ELLES SONT UTILISÉES
votre ennemi c'est l'autre, pas vous, faîtes du code qui dit ce qu'il fait et fait ce qu'il dit, ça évite d'écrire des commentaires, gardez la possibilité de masquer tout ça pour ne pas vous faire voler votre savoir faire par des yeux indiscrets et dîtes « je vire les messages de débogage ça fait plus pro » ;
sachez où vous perdez du temps ;
votre code a peut être une partie qui peut potentiellement avoir une boucle infinie, n'hésitez pas à donner des petits signes de vie (je te regarde graphviz) ;
vous êtes comme john snow, vous ne savez rien : paramétrez le plus possible de choses UTILES;
si vous pouvez écouter de la musique ou ouvrir firefox pendant que vous assemblez c'est que vous avez encore de la ressource exploitable,

Vous noterez dans la sortie que le muscle du projet prend moins d'une seconde sur 20 minutes : la partie fun est en code est rarement la plus longue, vous passez votre temps à vous battre contre l'absurde.

Par exemple, dot (la commande par défaut de graphviz fait une simulation physique pour placer les nœuds dans le graph, mais si vous mettez trop de nœuds (recouvrement) le programme pédalera dans la semoule sans vous prévenir. Par contre, si vous utilisez sfdp il va dézoomer pour résoudre les conflits et se faisant va faire que convert qui charge le pixmap de l'image en mémoire faire une allocation trop coûteuse pour vos 4Gb de RAM si vous ne limitez pas le nombre d'instance en parallèle et lui faire cramer un max de CPU. Le monde est fait d'optimum sniffés à l'air du temps qui nécessite autant de souplesse que pour une auto-fellation. Je répète PARAMÉTREZ, vous me remercierez.

Avoir des assemblages reproductible est votre fil d'ariane dans un monde chaotique et délirant : répétez après moi c'est pas parce qu'on code comme des sangliers qu'on doit coder comme des porcs. Je ne vous entends pas : répétez le plus fort ! Voilà, je pense qu'avec ce fichier d'assemblage vous comprenez ce qu'est la programmation avec les ongles. Pour retirer la merde du sol quand je nettoie la cuisine, je pourrais allez chercher un couteau affûté pour virer les traces incrustées sur le sol, mais non. Je prétend utiliser mes ongles, mais en fait je vais chercher un solvant : de l'huile pour la graisse (si si si), du savon pour l'huile, de l'alcool pour les encres ... mais ça reste une belle image : parfois les ongles sont plus lents pour faire la job, mais sont l'outil qu'on a juste sous la main, et on gratte là où ça chatouille.

Passons à la programmation à la pogne (le 2é fichier nécessaire quand on la base de données déjà embasée de mails) Quelques astuces :

LES VARIABLES D'ENVIRONEMENT EN MAJUSCULES EN DÉBUT DE FICHIERS, c'est plus facile à récupérer
PAS DE NOMBRES MAGIQUES
c'est plus facile de gérer un fichier qui inclut ses données que n fichiers de code et de données
si vous passez les outils de typographies pour formatter le code comme en entreprise, c'est que vous avez le temps de faire des débats philosophiques, ARRÉTEZ ! Vous avez une vie et un entraînement de boxe française à pas louper.
faîtes tout à la mimine si vous pouvez (genre la génération de fichiers graphviz)
un simple try catch vaut parfois mieux qu'un code trop élaboré
un script de mesure DOIT TOUJOURS INCLURE UNE LOGIQUE DE FILTRE PASSE BANDE, c'est souvent là où l'intelligence est cachée

D'ailleurs, parlant de passe bande, voilà à quoi ressemble les mêmes données quand on ne met que les VIP, on arrête de montrer une forêt et on fait comme Shannon le demande on montre les informations les plus pertinentes (H = k ln(S)), et surtout ça assemble plus vite.

Maintenant, passons à la programmation avec doigté, la partie du codage la plus fine qui permet de briller en entreprise et de gagner un max de caillasse.

Le plus important n'est pas de savoir faire mais de faire savoir Vous voyez un peintre en bâtiment est payé 10€/hr, mais un artiste qui fait la même chose si il écrit une dissertation autour de son geste ALORS d'un seul coup tout vaut 1000 fois plus cher.

Ce qui est important n'est vraiment pas de coder, mais de savoir en parler.

Voilà, j'espère que vous aussi êtes convaincu par la méthode « la rache » et allez me contacter en privé pour que je vous y forme (pour un max de blé).

Article garanti écrit à 1000% en méthode « la rache »

Nail programming : reinvented make in bash : you show your code, but if you don't tell how to build, you are a scammer

I always had a beef with so called open source project that ship their code but not their tooling for building.

You have code, but nothing is told about building the initial database, generating the doc, and how to fetch/build assets.
And that's exactly what I did in my previous blog posts about building a chronosociogram. But me I have an excuse : it's not ethical open source : it's wtf public license open source : MY FUN FIRST and no unpaid work hours for the benefit of parasits.

However, you can be as rebel as you want, you still need to build the code, but I do not build tool, I let them appear from « scratch programming ».

digression : the 50 shades of « hand programming »

I belong to the sect of the methodology « la rache ». « Programmer à la rache » (close to rush) is the main methodology appllied in France. Translated in US as « le » rush. Rush programming is an art that deserves a french article to distinguish it from the most common gruik programming (aka programming with your feet). Among this category they are the engineer that loves to do « assisted generative programming » either helped by IDE, « frameworks » or now the almighty Artifical Intelligence, and the rebel who prefer « hand crafted code ».

I live in Toulouse where rains comes in on flavor : raining or not. I came from the Vexin where rains comes as pluviote, crachin, drache, bruine, averse, giboulets ... Having the shades of perception for a simple task is useful when you don't went to be drenched by drache or overwhelmed by lack of defensive measures and sometimes, even in « la rache », we need to make things the right way to not drown oursleves in the complexity of our coding.

First is POGNE programming

It's how prototyping begins. With a clear view and a firm grasp of less than 100 sloc in a single NICE monolith.

Second is NAIL programming

Well, on the path of delivery you encounter difficulties you had not expected ... And instead of fetching your knife in the garage to scratch the spot on the ground you use your nails. Your code based gets disgusting, but YOU HAVE TOOLS to ease your job. But is hurts a little.

Last is mimine programming

Mimine is the gentle hand that comes to the rescue and make you spiral in building simple tooling for making your code manageable again.
As a craftman, you clear your workshop, sharpen your tools, clear the air and make your place ready to begin an another day of POGNE programming.

There is l'art (the academic way of building code) and la manière : your own personal sensitivity of « it works for me »®© packed in a few lines of code so that you have time to go to the savate boxe training (breaking knees the french elegant boxing way). La « Manière » is a pillar of « la rache » : don't go in another tool that you hate restepping the learning curve when you can brutally hack one.

Why you always need a makefile ?

Lol, if your code does one thing simply (like listing) I'm not sure you need it. :D

I needed a makefile because I introduced a 2 pass building for easing the life of my corei3 with 4Gb of RAM, which ... as a factor of serendipity helps building nice tools.

Let me explain it to you : using a dot file as a template is less expensive than REBUILDING the same dot file over each iterations... But since f strings/regexp are not my favourite in python, I did used perl to build the dot template generated from python to reinject it in the python script. An « n » stage build requires assets (artefacts) and sometimes your history forgot how you did it, and you too. So you need to create a makefile.

Good makefile tells you what they are doing, not only to understand when it fails but also, to help you build an intuition of where time is spent. It is as much an informative tool as a structuring tool.

What you want from a trivial makefile

Helping you when the day after a long focus on code your brain discarded every info that are touchy to remember : like how does your excited brain works. It includes :

in which order to do stuff
dependencies
parameters and API
wich stages can be resumed after an error in the stag
how to not go through a lot of useless stages when you lack time
Where you spend the most time when something changes
bash completion of course !

When you are an adept of « la rache », global states/variable are embraced like poison. At low dose they help, too much of them kills. In LARACHE you ALWAYS use global variable to avoid MAGIC NUMBERS BUT they MUST BE ON TOP of the code every time.

In LA RACHE we embrace universal key value passing from perl to python to bash to ffmpeg : for this we use a secret tool : non documented environment variables so that we can later choose what to expose.
It is a simple dispatch table called kind of recursively if you consider stack baded recursion as noble enough and it checks for artefact presence to call for depencies if required.
I am pretty proud of a « racherie » -the 3 perl-oneliners there- that actually transform the dot file into a f string for use a template for python. As a source of wisdom I dare say ; a free text written by an algorithm is always easier to parse with a regexp :D

Why does the making has more code than the main code ?

For the same reason in real life you often pass more time cleaning your environment before (and normally after) a task. A stuff manager never wants you to add in your timesheets because « it is non productive ». Well, here at home, I have no managers to tell me how to spend my time wisely. So ... I do whatever pleases me.

What is the interest of this ?

Practically, the debug messages gives me a clear intuition of where my computer passes more time. Intuition that I refined with printing dots when interesting. Thanks to this for instance I noticed

dot

and

convert

where 100% monocore, else I would have not parallelized part of the code according to my number of cores.
Also having an intuition of where time is taken can help see counter intuitive results. Like with sfdp. Not super fils de p..., the graphviz engine intended for speed. Except the algorithm uses scaling out to avoid overlaps efficiently making the converting waaaay slower. So all in all, sfdp + convert is slower and more prone to OOM than dot + convert. It is something you cannot guess if you mute all feedback. Call it visual profiling :D

And also, in makefile putting code is a major pain. When making an almost cascading model of tasks, involving few dependecies coding the logic (2hours) totally worth the fun of the process.

Having no code review demanding I begin my prototype with clever CLI libs is also nice. Environment variable have regain traction for argument passing (docker has got finally a good side), I can pass variable without having the craziness of getopt syntax to handle which make everything easy, including back and forth from the make and code.

Bash completion is fairly easy as a one shot :

complete -W "all still_images muscle backbone movie clean very_clean" ./make

of course I could let make have a completion function, but it does not worth the pain.
Makefiles and whatever flavour of the snake oil you are drinking is above all about focus. Separating your code that is complex and you want to focus on from the rest. Without this, juggling with your short term memory for building and debugging becomes hell. Of course, locally I use a version control software à la git, because, regression all bites us hard when the do. Especially the idiot forgetting to use a VCS.

Bref, I don't say makefiles are shitty, I say : I don't compile C here, my global needs are the same (reproducible builds) but my path is different since I use bash/perl/python/cli that are better invoked in a shell.

Being home is really the place where coding is nice and relaxing. I don't trust pro coders that refuses to touch a keyboard outside : how can they know their actual belief about style, dos and dont's, languages, frameworks are on par with reality ?

Building a Better chrono-sociogram : a radar picture of social interactions

Humans are like a flock of birds of the same feathers that have a natural tendancy to change their feathers a lot.

-- Coco Chanel while french kissing a gestapo officer ~1943 in Paris

What is wrong with sociograms ? They don't catch the way people change their loyalty. I think an election is interesting especially when it was the most massive turn cloak event ever seen in french politic.

So to solve the lack of tooling for this there is a tool : the chrono sociogram aka : social movie.

First was the persistent map

By feature, dot will correctly guess galaxies of connection that have existed over a long period of time. Hence we begin with a persistent map that is similar to all links set to 1 over a long period of time.

Why, because the same way radar maps are useless if cluttered with too much useless informations, you will not care about WHOM talked to WHOM but much more where storms happens, when.

Unreadable ? It is here in SVG.

Well since my corei3 (sorry for being poor and not being able to afford a M1 mac and a GPU for mining bitcoin) is taking 10 hours to generate 243 dot files), imagine how much time it takes to generate the following : a movie out of keeping the persistant image and only changing the color/size of the edge by templating THIS exact picture :D

What is it useful for ?

The first time I worked on this it was in 1997 in ENS as an intern knowing linux in a complex system lab, having a grant on influencing citizen for the « greater good of Political Agricole Commune ».

UE had failed miserably at convincing peasants (red necks) that there interest was in accepting UE money and being enslaved to debt for their life being. With this grant, the network of peasanry and exchange used, sociogram was built and « highly connected nodes » identified. Nowadays we call this « influencers ».

UE just flipped the opinion of a few selected one (ignoring the background of why THEY were selected) and it cascaded in convincing farmers.

This video illustrates like storms, both exchanges in social networks and when/how people turn coats. It is ACTUALLY the very heart of social media.

Hence when I claimed some originality on this work, I point blank lied. However, try to find people sharing their tooling, methodology and you will discover that your public taxes found public research lab which work you can't read and which tools you cant' use on the topic.

Graph problems are NP, exploring graphs require a lot OF CPU, how do meta/FB handle the network analysis I just done for a fraction of the CPU ?

Well, they don't observe the network, they also model it by rewarding people who ARE ALREADY INFLUENCERS in real life thanks to « side channel ». Ex : for publishing videos at the actual quality standards in video and sound processing you need WAY MORE MONEY and better equipment than for writing a blog post that can be done with a 12yo computers. There is a bias of selection of just using the main stream tools.

How can we avoid being influenced ?

Don't : learning of foreign cultures is nice. I love manwhas, mangas and folklore and foreign languages as much as any influencers do. At the opposite of them, though, I give no interest in fame.

What traps you in the web of influence is your the reinforcment of your own bias. Trsuting hyped talkers is a bias..

Have you tried talking to friend of yours about topic on which you disagree and state that you want to explore influence ? Life if polemic. You are better prepared to it by training like a boxer : every day in a pleasant non aggressive mindset like when going to the boxing training. (Yes, I love boxing).

Ivy league education prepare to this thanks to « concours d'éloquence », it's also if you notice the core of how slang culture treat the others.

Now that I finished a project I wanted to do for long, I think my sociogram time is other, and I may do a last topic on the making (and thus introduce my bash sort of makefile and how with perl oneliner I templatized a dot file to create a usable template for python because this trick is serverly disgusting (I mean FUN).

Real life stuff

All the spoke persons even though they were not the ones spied on of the government did have a very active involvments while being paid by tax payers money. It may be legal, it does not look moral when the same one advocates for better spending of the tax money. When you are receiving public money for one reason to support your expenditures to either be a public servant or a representant of the people, you don't work for private interests on the side. But well, we neither call french state l'Assiette au beurre (the plate of butter) for nothing, nor Paris Paname without a reason : it's because of this lingering kind of corruptions scandals that don't kind of motivate you to vote.

Next episode

The poor man's makefile in bash.

Annexe

Building a CHRONO SOCIOGRAM from real world data

My movies are but the capture of a fleeting moments of life that takes place in a 1:1 map of human relationships. The map is a useful simplification of the overwhelming complexity of a territory in which I strive to explore human emotions.

Ingmar Bergman

In previous episode (last 2 posts) I did explore HOW TO represent links between social animals to try to infer who knows who and how these animals organize themselves as a herd : it is a sociogram.

It's a map, and a bad one at this. You see as in H2G2 (hitchiker's guide to galaxy) it's not the map that make the story, but construction work that changes the map. To « prove » my point I will study real life data for which (ethical/legal disclaimer the victim of the data breach and THE CREAM OF ALL JOURNALISTS admitted there was nothing. The hacked team said it was an honeypot (which will prove to be quite true), and nothing of interest was there (which is quite false). Hence, if you want scandals or REVELATIONs, sorry to disappoint you, sociogram don't unfold secret conspiracies.

I must also say I am a sysadmin first. My soul burns when exposed to private email contents, so ... I love sociograms that are in the grey area : you look the enveloppe without reading the mail body.

In my database structure you saw I DO populate my database with the text_mail, and you guessed I did use real data since day 1. Coding out of abstraction is nice, but nothing beats real data to actually stress out your abstration and torture your tools, especially mails that are per RFC2822 the most annoying archive format ever to parse (that I love, I began as a sysadmin specialized in mail).

Without further ado I will show you the result and explain how it was built thanks to postgres FTS (word count does not count as reading email body, doesn't it ?).

Legal disclaimer

It is a sociogram of Person Of Interests from the macronleaks. I will define a POI as : either hacked or (my own definition) any public servant that by being a is submitted to article 14 of constitution Article 14
Tous les citoyens ont le droit de constater, par eux-mêmes (...) la nécessité de la contribution publique, (...), d'en suivre l'emploi. The payroll of public servants being public contribution submitted to « devoir de réserve » (hence NOT INFLUENCING election), I am entitled to check if public money was indeed used with my constitutional rights at heart.

Aristotle said elective systems are the opposite of democracy : democracy is the ruling of the people by ANY people, the elections introduce a bias of selection of the candidate by chosing who has access to the tribune hence making vote a biased choice in a biased subset of the people. I will add Aristotle was an idiot.
Democracy is not a STATE function, it is a PATH function. Democracy is a path and a journey made of polemics of improvements. Elective systems represent a f**d up state of the situation but it doesn't prevent democracy to exists has long as «a good huguenot's belief » everyone can discuss as an equal. I BELIEVE IN REFORMS, but I don't go to the temple (protestants bore me to death) (cf Frankfurt school//Habermas).

So let's see if my constitutional rights to choose from the largest democratically large subset of population was respected or did public servants entitled to enforce the spirit of the Constitution meddled with my right to choose from a larger subset of candidate than during last elections. Is the delta « positive » in term of democracy ?

TL;DR : the reform I would want after this study would be for administration to make ALL enveloppe of public servants mail and private sector public. I would also like to have all public communication in a database I can query without rights of elected members/public servants to DELETE them.

Hence here is a real life sociogram of public servants//elected members during Macron's campaign.

Real life feedback

At the opposite of developpers I know when a tool sucks. This MAP SUX.

A good map is easily readable and does not clutter the screen with TOO MUCH INFORMATION. I find it nice and readable, because I have been trained to read the functions of a chip by looking at the silicon with a microscope, and compared to a SoC, this is peanuts. It's not even has complex has modulator/de-modulator (MODEM). However, I am also human. It's really a mess.

You may criticize people with nice words, I don't coat my words with sugar : it is plainly unusable because of H = k . ln(S). There is too much choices that are not relevant.

This is a sociogram from may collected on a 2 year span, hence it is making stuff that were fleeting moments persist over time. It has good sides : you can guess the galaxy of influences by looking at the constellations of contact near the person that were hacked : ministry of the finances, minister of public health, minister of royal importance (justice/interior), municipality of capital (a state in the state), and military industry.

It has the same value has a persistant vision : making a blurry map with a feeling of what is there, but no details.

Hence, we are gonna introduce another tool : that complements the sociogram with ... a CHRONO-SOCIOGRAM. MOVING PICTURES of SOCIOGRAM. V1 is ugly, but good enough to give an idea of what comes next: a freakinig chrono sociogram where nodes stand still and where I will only higlight temporarly the edges.

I will not sugar coat my words : it is still shitty as hell, BUT, at any given time a normal brain can often see what matters : who talked to whom when ? Which for shit diggers having access to the content of email would be the tool to ease the inquiry job of isolating the informational window in which to scrutinize contents. BUT I WILL NOT ! I am a PROUD SYSADMIN, muttafuckas !

But how did you found the POI in the first version btw ? Cheater ?

You can predict everything except the future, hence the reason in applied physics we make prediction on past data so we can foresee the results.

Niels Bohr

I don't cheat, I used science. I put in all members of government as POI and spoke person of it, knowing it was already public they worked in the campaign. But I discovered soon, knowing the future was not necessary.

Let me introduce : POSTGRESQL FULL TEXT SEARCH's FEATURES : not trying to add words in a corpus based on inference.

Once you google for regex postgres array you have a neat function ~#!@ that is very useful and you can fire the word count and select emails by frequency from a point.

Knowing that the president real email is ALWAYS in the form e2m@ it gives :

ml=# select * from ts_stat($$select to_tsvector('french', text_plain) from mail where  '^e2m.*' ~!@# any("from") order by date ASC$$) where word ~* '.*@.*' order by nentry desc;
                  word                   │ ndoc │ nentry 
─────────────────────────────────────────┼──────┼────────
 e2m@en-marche.fr                        │   18 │    115
 barbara.frugier@gmail.com               │   11 │    104
 clement.beaune@gmail.com                │   11 │    102
 ismael.emelien@en-marche.fr             │   19 │     36
 sylvain.fort@en-marche.fr               │    8 │     16
 benjamin.griveaux@en-marche.fr          │    6 │     12
 julien.denormandie@en-marche.fr         │    6 │     12
 brigitte.macron@en-marche.fr            │    6 │     12
 sibeth.ndiaye@en-marche.fr              │    6 │     12
...

So when president is in from these are the most likely emails to have been quoted.

Word count does not count has reading emails body content, for real especially with a filter on "@".

And, step by step, you build the sociogram by doing a top most quoted email using the first ring, the second ring and then you stop because it is already much...

Nothing magical, just stupid out of the box function from the database used in a perverted way.

Since these leaks are old news, you can then correlate with what you know.

barbara was indeed the PR
clement became a spokeperson of the President or a minister
...
sylvain the ghost writer who wrote « revolution » for the president

Being as sleaky as google/Meta does not require brain. Just building tools.

At this point the code (embedded data not counted) for making « a movie » is still 100sloc and I code AT BEST 10 sloc a day. I'm a messy sloth.

Here is the code for build the movie

pushd out
for i in *dot; do dot -Tpng $i > $( basename $i .dot).png; done
for i in rec.????.png; do convert $i -resize 2000x1400! re.$i; done
rm output2.mp4
ffmpeg -framerate 1 -i re.rec.%04d.png -c:v libx264  -crf 30 -pix_fmt yuv420p output2.mp4 && firefox output2.mp4
popd

And here is the code for building the sociograms :

#!/usr/bin/env python3

import os
import psycopg2
from datetime import date, datetime, timedelta
from archery import mdict

def int_env_default(var, default):
    return int(os.getenv(var) or default)

MIN_MAIL = int_env_default("MIN_MAIL",6, )
MAX_MAIL = int_env_default("MAX_MAIL",100)
WL_MIN = int_env_default("WL_MIN", 3)
CUT_SIZE = int_env_default("CUT_SIZE", 20)
DATE = os.getenv("DATE") or "2016-01-01"
END_DATE = "2017-05-01"
BY_DAYS = int_env_default("BY_DAYS",4) # 13x28 = 364 ~ 365.5
NB=1

end_date = date.fromisoformat(END_DATE)
date = date.fromisoformat(DATE)
td = timedelta(days=BY_DAYS/2)
td2 = timedelta(days=BY_DAYS/2)



                
def is_ilot(node:str, edge_dict:tuple) -> bool:
    """ilot == has only 1 link back and forth either in (from,) or (,to)"""
    count=0
    for edge in edge_dict.keys():
        if node == edge[1] or node == edge[0]:
            count+=1
        if count > 2:
            return False
    return True

patt_to_col =  dict({
    "e2m":"red",
    "emmanuel.macron":"red", 
    "emmanuelmacron":"red",
    "alexis.kohler" : "midnightBlue",
    "gabriel.attal" : "orange",
    "sejourne.stephane" : "grey15",
    "stephane.sejourne" : "grey15",
    "olivia.gregoire" : "darkOrange",
    "veranolivier":"green",
    "julien.denormandie" : "lightBlue",
    "sibeth.ndiaye" : "orange",
    "barbara.frugier" : "green",
    "cedric.o" : "purple",
    "gouv.fr" : "maroon",
    "snecma.fr" : "purple",
    "safran-group.fr" : "purple",
    "benjamin.griveaux":"blue",
    "laurent.bigorgne" : "darkBlue",
    "jean.pisani-ferry": "yellow",
    "luc.pisani-ferry": "yellow",
    "ismael.emelie" : "orange",
    #"jesusetgabriel.com" : "crimson",
    "gregoire.potton" : "lightGreen",
    "eric.dumas":"salmon",
    "alexandre.benalla" : "darkGreen",
    "pierre.person" : "darkBlue",
    "pierrperson" : "darkBlue",
    "quentin.lafay":"grey10",
    "fm.alaintourret" : "purple",
    "@paris.fr" : "orange",
    "langanne" :"pink",
 })
    

wl = lambda s : any(map(str.startswith, patt_to_col.keys() ,s))
def in_wl(mail : str):
    for l in patt_to_col:
        if mail.startswith(l) or mail.endswith(l):
            return l

def wl(pair: tuple):
    for l in patt_to_col:
        if in_wl(pair[0]) and in_wl(pair[1]):
            return patt_to_col[in_wl(pair[0])]
#assert wl(("jesusetgabriel.com", "jesusetgabriel.com")) == "crimson"

is_vip = lambda t:all(map(in_wl, t))
fn=0
while date < end_date:
    fn+=1
    direct=mdict()
    final = mdict()
    conn = psycopg2.connect("dbname=ml host=192.168.1.32 port=5432 user=jul  sslmode='require' ")
        
    with conn.cursor() as sql:
        sql.execute(f"""SELECT "to", "from" from mail where DATE BETWEEN '{date}' AND '{date+td}';""")
        while t := sql.fetchone():
            for fr in t[0]:
                fr=fr.strip()
                for to in t[1]:
                    to=to.strip()
                    if fr != to and fr and to:
                        direct += mdict({ (fr,to) : 1 })

    date += td2
    tk= list(direct.keys())
    def has_more_than_n_neighbour(email: str, n :int, final : dict):
        count = 0
        for k in final.keys():
            if len(set([email]) & set(k)):
                count+=1
                if count >n:
                    return True
        return False
            

    for k in tk:
    # dont modify a dict you iterate hence copy of keys
        if is_vip(k) and k not in final  and k[::-1] not in final: # or wl(k) and (k[1] != k[0] and k[1] and k[0] and k in direct and k not in final and k[::-1] not in final and k[::-1] in direct \
               
            final[k]=direct[k]
            final[k]+=direct.get(k[::-1],0)
        else:
            try:
                del(direct[k]) 
            except KeyError:
                pass
            try:
                del(direct[k[::-1]]) 
            except KeyError:
                pass


        
    tk= list(final.keys())

    for e in tk:
    # dont modify a dict you iterate hence copy of keys
        if not has_more_than_n_neighbour(e[0],NB,final) or not has_more_than_n_neighbour(e[1],NB,final):
            try:
                del(final[e]) 
            except KeyError:
                pass
            try:
                del(final[e[::-1]]) 
            except KeyError:
                pass



    color = "".join([  f"""{i[1]} pour {i[0]}{[", ",chr(0x0a),][(n%4)==3]}  """  for n,i in enumerate(patt_to_col.items()) ])
        

    conn.close()
    with open(f"out/rec.%04d.dot" % fn, "w") as f:
        title = f"""label="Sociogramme de {date} à {date + td} extrait des macron leaks orienté gouv.fr, personne d'intérêts (vert), victime du hacking (rouge), et président (bleu)\n
    entre [ {MIN_MAIL}, {MAX_MAIL} ] échangés \n
    plus gros liens au dessus de {CUT_SIZE} mails échangés entre interlocuteurs \n
    couleur par priorités selon les origines \n
    {color}
"""

        print("""
    graph Sociogramme { """ + f"""
    fontname="Comics sans MS"
    outputmode="nodesfirst"
    size=20
    label="Sociogramme de {date} à {date + td}"
    ratio=1.7
    labelloc="c"
    labelloc="t";
    """ + 
    "\n".join(['  "%s" -- "%s" [label=%d color=%s penwidth=%d ];' % (k[0],k[1],v, wl(k) or "black", 1 if v < CUT_SIZE and wl(k) != "red" else 4 ) for k,v in final.items()])
     + "}", file=f)

PS : I'm totally sure I saw during FSM 2005 or 2004 (20 years ago) exactly this about python commit during a conference. A lot of kudos to whoever can find the link and say : boooooh it's not new imposteur ! PPS : next blog post solving the graphviz variable problem that f*** ffmpeg with a fstring and being inventive in less than 20 sloc :D
I'm totally gonna dump the full stuff and use perl to make a template in f"" with perl :D trololololol

Building a sociogram from mail archives in python and postgres

TL;DR : « I know that as john snow, I know nothing, and it is not a problem »

Foreword : typing in CS is WRONG

People are different. I am, you are, he/she are, it are.

I don't talk about gender, I talk about mileage, life experiences and shit.

Me, I come from being a half thug, half educated (D&D bi-classed) including the local kingpin of forging fake documents for « école buissonière » (skipping school) with my early 90s computer, cheating with pctools, being the kingpin of a ring of cracked software from ranging from amiga to powerPC.

I learned coding not as an educated CS 101 but as a dyscalculic and dyslexic student in physics redeemed by his facility in foreign language especially if they have a germanic touch (french, english, german).

Foreign languages have taught me the power of composition of pre/post positions that are consistent like dé-ménager is opposite on en-ménager which is same as moving-out compared to moving-in.

Also, programming in physic is not the same as programming in CS. We don't have budget for programming hours, so our teacher don't care about elegance and high level programming or low memory consumption. They want easy to read and debug code that works fast. Also, physics has long resolved the typing debate : computer academics are just so high they don't get it.

See : if I have a unit in meter per seconds and I add kilograms, we want the code to crash.

Strongly typed language (including ADA or VHDL) don't have « symbolic consistency » but DATA TYPE consistency. And I will illustrate it in the code to come with the is_ilot
function.

Mindset and tools : why I did not use overwhelmingly beautiful postgres features but wasted my CPU and memory doing the sociogram in python (and why not in Perl)

My problem is simple : imagine I have a sample database to have fun containing real life email and I want to make a graph out of it. Like an Xleaks from wikileaks, and I want like Meta, or google to infer who are the interesting persons interacting out of the thousands of interlocutors. So let's dump my SIMPLE sql schema :


DROP TABLE IF EXISTS public.mail;

CREATE TABLE IF NOT EXISTS public.mail
(
    filename text COLLATE pg_catalog."default" NOT NULL,
    "from" text[] COLLATE pg_catalog."default" NOT NULL,
    "to" text[] COLLATE pg_catalog."default" NOT NULL,
    subject text COLLATE pg_catalog."default",
    attachments json,
    text_plain text COLLATE pg_catalog."default",
    date date,
    message_id text COLLATE pg_catalog."default",
    thread_id text COLLATE pg_catalog."default",
    CONSTRAINT mail_pkey PRIMARY KEY (filename)
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.mail
    OWNER to jul;

A sociogram is a graph made of relationships (edges) between persons (nodes) with their strength expressed in number of occurences between persons.

I basically want

for each mail
   for the cartesion product of  to and from
       add an edge in graph

I use the language of building microchips I learned in applied physics of micro-electronic here. Not a computer stuff I cannot visualize in my f***d up brain.

The problem are « HUMANS ». They chat to a lot of persons but there are 2 kinds of persons : persons who linked other persons and persons who are just noises having a 1:1 relationship.

Analysing visually thousands of nodes in a graph is doable, but tiring. So we have to weed out nodes that are useless. People who are just leaves in the graph having only a bijective relationship.

Digression on the choice of tools

I love posgres, but I am dyslexic and don't use it often. I have as an only tool psql to build request and stack overflow to answer my questions involving recursive request on array to build a cross product.

Stack overflow is good at solving one problem. But composition of answers is tough, especially when I use reserved WORDS in postgres as column names. And you are gonna wonder why ?

MY OWN STRONG TYPING : name a thing by it's real name and PUT UNITS in their name if you can.

Also, physics taught me to avoid recursion at all costs because .... see point #1, stacked base thinking does the same for less bugs and head/p overflow (I come from an era where size of heap for call recursion where laughingly low on linux, C, Perl).

So I did my own code in python wasting memory and CPU for something I am well aware there was an elegant, faster solution in postgres, but early optimization or early result was a fast choice : I want for gruik coding without hesitation. Why did I used postgres ? Parsing email is a cost of HOURS. Getting the « to »s and « from »s time consuming. I totally amortized postgres for the use as a more efficient DATASTORE than the filesystem. No regrets here.

The code : KISS

Building a sociogram weeding out the « ilot » (edges connected to only one and only one other edge) is fairly easy once you coded the

is_ilot

function. It has been 2 full hours of work for a problem I never faced before applying academic knowledge of micro-electronic to a social graph.

import psycopg2
MIN_MAIL=20
MAX_MAIL=400
from archery import mdict

conn = psycopg2.connect("dbname=ml")

direct=mdict()
final = mdict()
    
with conn.cursor() as sql:
# cartestion product done the dumbest way possible
    sql.execute("""SELECT "to", "from" from mail""")
    while t := sql.fetchone():
        for fr in t[0]:
            for to in t[1]:
                direct += mdict({ (fr,to) : 1 })

                
def is_ilot(node:str, edge_dict:tuple) -> bool:
    """ilot == has only 1 link back and forth either in (from,) or (,to)"""
    count=0
    for edge in edge_dict.keys():
        if node == edge[1] or node == edge[0]:
            count+=1
        if count > 2:
            return False
    return True
    
tk= list(direct.keys())
for k in tk:
# weeding out pass 1
# dont modify a dict you iterate hence copy of keys
    if k in direct and k[::-1] in direct \
            and MAX_MAIL> direct[k]+direct[k[::-1]]>MIN_MAIL:
        final[k]=direct[k]
        final[k[::-1]]=direct[k[::-1]]
    else:
        try:
            del(direct[k]) 
        except KeyError:
            pass
        try:
            del(direct[k[::-1]]) 
        except KeyError:
            pass


    
tk= list(final.keys())
for e in tk:
# dont modify a dict you iterate hence copy of keys
# weeding out pass 2
    if is_ilot(e[0],final):
        try:
            del(final[e]) 
        except KeyError:
            pass
        try:
            del(final[e[::-1]]) 
        except KeyError:
            pass

    if is_ilot(e[1], final):
        try:
            del(final[e]) 
        except KeyError:
            pass
        try:
            del(final[e[::-1]]) 
        except KeyError:
            pass



conn.close()
# output for graphviz
print("digraph Sociogramme {")
print("\n".join(['"%s"->"%s" [label=%d];' % (k[0],k[1],v) for k,v in final.items()]))
print("}")

You will notice I « typed » the python function. It took me 20% of the time because I was still in postgres thinking of postfix typing (::array) and not in infix typing (list :) and it's because I had ... a bug to solve. At first I coded the function with one letter name as I usually do until I have a problem. I backtracked it, changed the name and put the typing annotations for fun but what really helped me was : the only strdoc in the code and naming. Remembering what was my purpose and what edges and nodes were. As soon as I named correctly the function, the inputs, and wrote the strdoc it was done magically and I laughed at how typing was missing the point.

a_dict could be a list of keys, a sparse matrix, a mutable mapping, a string. The type of data structure I use to represent a node or an edge is so varied typing does not help.

Here is a sample of the output before after the weeding out with

is_ilot

Before

After

Final word

I want this post to be an advisory to noobs like me to not care about academics but let the accident of programming not rebute them for going the way they see fit in a Keep It Simple Manner that « work for you »©®

Datastore and database : why it is a good idea to not confuse both.

I strongly advise to fast watch this video, since datastore and database, as well as hierarchical database are covered there.

Now let's come back in 2024 and wonder, well, is this still relevant.

I recently had fun taking a simple use case that made one postgresql contributor famous : Julian Assange Xleaks.

Should we put mails in database or keep the database as an index ?

As a foreward Python mail parsing is infamously not on par with Perl from which it has ported its libs. We will imagine we use a thin layer on top of it know as mail parser and that our mail comes from google, outlook, and thunderbird archives.

After all, mails are so important in our modern life that the use case of analysing mail with a database, or a nosql or .... is important.

My favourite « kind » of non database data-base are like LDAP hierarchical tree like datatype. They fit verey well with my aggregation lib for dict in python (archery a good way to shoot yourself an arrow in the knee). And of course, POSTGRES DOES IT, POSTGRES DOES EVERYTHING BETTER THAN ANYONE. (I intend it as a troll, but, in the bottom of my heart, postgresql and sqlite are my 2 favourite databases).

However mails ARE relationals thanks to from/to, and thread-id. If thread reconstruction may favour hirearchical databases, your unit of search will be mail and wanting to mail a many-to-many relationship like « find all mail where X and Y are linked.

One approach on parsing 1000th of email is relying on an SSD and brute forcing your way through mail every time. But, lol, I'm poor, I have an hard drive wich I fear for its life with all my precious photos of my family, so I prefer not too. Plus it's long. And that's when you remember surfing the paperspace and the liibrary analogy.

Database is supposed to be the cabinet in the entrance where you can find all the name of the books that are physically located in the library and where they are. Ex : an author may be a co-author (like Good omens from pratchett in Fantasy, in SF (yes Pratchett wrote SF), and in fantasy.

Fetching a book is long, the cabinet (database) helps you find the books fast in the datastore (the shelf of the library). A database is pretty much a tool for indexing slow retrieval data that you want to find fast.

In 2024, is it still relevant ?

I beg to argue yes in the famous realm of « it works for me this way »©® (famous saying of linus torvalds).

Hear me out : mail parsing is shit.

Imagine, you have an email : per RFC, you can have MORE THAN ONE BODY (a mail is an n-relationship between enveloppe and body) and BODIES can be either TXT OR HTML and more often than not (thks to the creativity of the artists) may bear different messages since SOMETIMES what you see matters, hence, the text body is litteral garbage if you want to make sense of the mail.

Mail per RFC 2822 can have attachments that can embedd attachments that refers to each others (often in an mixed peek and poke of both an arrayish and tree-ish data structure.

Ho ! Is it perfect for postgresql XML/JSON ? NO ! Recursive descending SQL request may exists, but joining on non deterministic disposition is begging for losing too much brain cell : it is not because you can do it that YOU SHOULD DO IT. Sometimes the most reliable to access attachments is not from

parser(mail).attachments.to_json

(that is half backed) but with

parser(mail).write_attachments

that renders embedded attachments inside the documents in a more reasonable readable way. And sometimes more often than a python guess, what you want is a heavy client for READING MAIL that embedds decade of wisdom on how to deal with broken norms.

Hence, practically, when you deal with mails what you want to store in the database what you want to store is :

path to the DATA-STORE (hence HARD DRIVE)
from
to (BCC, CC for me are tos too, since I want to draw sociograms)
message-id
subject
thread-id
DATE which is a nightmare since it's a clear hell to PARSE
and because experimenting with JSON is fun the attachments metadata object/dict (application type, content type, filename) as an JSON
and because FTS (full text search that can survive typo) is fun the FIRST text body (but you will miss when astute individuals will use more than one)

Legal archiver and geeks MIGHT want to add : the chain of mail servers, and all data concerning validation to try to detect signs of spoofing. Which gives us almost this :

CREATE TABLE IF NOT EXISTS public.mail
(
    filename text COLLATE pg_catalog."default" NOT NULL,
    "from" text[] COLLATE pg_catalog."default" NOT NULL,
    "to" text[] COLLATE pg_catalog."default" NOT NULL,
    subject text COLLATE pg_catalog."default",
    attachments json,
    text_plain text COLLATE pg_catalog."default",
    date date,
    message_id text COLLATE pg_catalog."default",
    thread_id text COLLATE pg_catalog."default",
    CONSTRAINT mail_pkey PRIMARY KEY (filename)
)

TABLESPACE pg_default;

This is ALREADY speeding up without hammering my hard drive all of the following funny use cases : This said, I have already fun using postgres and shell doing sociogram by first ranking people in to/from relationship above a threshold and then drawing the sociogram.

Here is a simple request to see who are the top senders


cat <<EOM  | psql ml jul | less
SELECT DISTINCT ON (times_seen, element) element
    ,COUNT(element) OVER (
        PARTITION BY element
        ) AS times_seen
FROM mail
    ,unnest("to") WITH ordinality AS a(element)
ORDER BY times_seen DESC, element DESC;
EOM

Making histograms about how much « to » you have per mail


psql ml -c ' 
WITH len as (SELECT cardinality("to") as cto from mail) 
SELECT 
	width_bucket(cto, 1, 43, 10) as bucket, 
    count(*) as freq 
FROM len 
GROUP BY bucket ORDER BY bucket;
'
# too lazy to put the min/max functions

 bucket | freq  
--------+-------
      0 | 41083
      1 |  9965
      2 |  1096
      3 |   227
      4 |   116
      5 |    88
      6 |    12
      7 |    14
      8 |     4
      9 |     2
     10 |     2
     11 |    26
(12 rows)

Making dot diagram of who speaks to whom in a one to one relationship (strong link) when it it more than 20 times :

( echo "digraph G {" ; 
	PSQL_PAGER="cat" psql ml jul -ta -c "SELECT \"from\"[1] || '->' ||  \"to\"[1]  FROM mail WHERE cardinality(\"to\") = 1 and cardinality(\"from\") = 1; " \
    | sort | uniq -c \
    | perl -ane 's/^\s+(\d+)  (.*)\-\>(.*)/"\2" -> "\3" [label="\1"];/g; print $_ if $1 > 20'  ; echo "}" ) \
    | dot -Txdot | xdot -

And that's already neat. By treating the envelope as an orthognal data space than datastore, you treat a new shit-ton of informations that are normally hidden in the mail that is already (too) full of information. I stop writing this post to come back having fun with my database while I leave the mail in the cold of the data-store :D

Addendum : And I was furious that the SQL was so complex for querying reciprocal relationships in arrays that I came (since my dataset is small and I have A LOT OF MEMORY (4Gb) with a neat more exact python/SQL/BASH/xdot solution that will cringe any SQL/bash/python purist (lol) but that is a pipe of simple operations anyone can understand.

#!/usr/bin/env bash
MIN_MAIL=40
TF=$( mktemp )
PSQL_PAGER="cat" psql ml jul -ta -c "
SELECT \"to\"[1] || '->' ||  \"from\"[1] 
FROM mail 
WHERE cardinality(\"to\") = 1 and cardinality(\"from\") = 1; " > $TF
python -c "from archery import mdict" || python -mpip install archery

python <<EOF | xdot -
import re
from archery import mdict
patter=re.compile(r"""^ (\S+)\->(\S+)$""")


direct=mdict()
final = mdict()
    
with open("$TF") as f:
    for l in f:
        try:
            (fr, to) = patter.search(l).group(1,2)
            direct += mdict({ (fr,to) : 1 })           
        except:
            pass

tk= list(direct.keys())

for k in tk:
# dont modify a dict you iterate hence copy of keys
    if k in direct and k[::-1] in direct and direct[k]+direct[k[::-1]]>$MIN_MAIL:
        final[k]=direct[k]
        final[k[::-1]]=direct[k[::-1]]
    else:
        try: del(direct[k]) 
        except KeyError: pass

        try: del(direct[k[::-1]]) 
        except KeyError: pass

print("digraph Sociogramme {")
print("\n".join(['"%s"->"%s" [label=%d];' % (k[0],k[1],v) for k,v in final.items()]))
print("}")


EOF

We need a more anti-clerical mindset in Information Technologies

I have recently made a rabbit hole in postscript, PDF, SVG, tk/Tcl (hence modern graphical UI), unicode and have worked intensively in web technologies. And I can tell you something : the modern clergy must die !

Talking like an ivy league student (https://mail.python.org/archives/list/python-ideas@python.org/thread/AE2M7KOIQR37K3XSQW7FSV5KO4LMYHWX/) is the freaking norm in open source software, and it ain't no secrets that expressing yourself like a « vulgar » street guy make any of your comment thrown to the thrash. I am « deftones » BORED of this, and I think this is becoming a major problem.

Clergymen are the knowledgable persons who from their ivory tower classify, gives names and structure the knowledges of the others. Charles Marche would say the « bourgeois » appropriate the « praxein » (how to) by transforming it into « doxein » (know how). But what else the bourgeois are famous for ?

Creating private property everywhere they can : on numbers (ISDN, IPv4, IPv6, DOI, OUI, PCI ids, BGP prefixes), on glyphemes (emojis, fonts), on sound (musics), fucking frequencies (5G, digital radio), time (RT GPS) ... Anything that physics can define as observable : even freaking colours can be patented !

Us from the « soute à charbon » (coal bunker) have in common with the boss of startup that we began coding with the raise of lowspec computers (C64, Z80, apple II (for the riches), CPC464, HP28/48 (kind of expensive at the time) ...). Why aren't we the boss ? Because someone need to do the job, and it won't be the bosses, the lawyers and their balls leakers (academic). Because, I vy League are the crucible of where these balls suckers are molded in this culture that sees THEIR views as universal.

Recently I have been raising to python and firefox my concern that I have NO LINE OF DEFENSE against something that seems irrationnal to me from the « coal bunker » : I cannot detect or correct when unicode bidir algorithm are presenting a < as a >

from unicodedata import normalize
from wsgiref.simple_server import make_server

def simple_app(environ, start_response):
    status = '200 OK'
    headers = [('Content-type', 'text/plain; charset=utf-8')]
    start_response(status, headers)
    ret = [("%s < =? %s" % (n, normalize(n,"\u202b>\u202b\n"))).encode("utf-8")
           for n in ("NFD", "NFC", "NFKD", "NFKC")]
    return ret

with make_server('', 8000, simple_app) as httpd:
    print("Serving on port 8000...")
    httpd.serve_forever()

Take a web browser and admire that < and > both appears and no stdlib/filters/normalizations exists for this.

What is a contract ? (Beware, I have huguenot origin the holy geist of capitalism runs through my blood)

If I show you a contract on screen, what should I consider the stuff on which agreement has been made ? What is shown or what is written ?

Since, we automatize contract processing it does not matter. What is written of course prevails on what is shown. And all of firefox/python coders said : « the holy unicode norm prevails ».

I don't want to pick a fight with unicode since only persons who actually have been experiencing first hand the madness of unicode editing bidir mixed strings will grock whatever I said.

Editing bidir string is like following the white rabbit under LSD of Alice in wonderland. The clergyman will tell you « YOU FAILED BY NOT FOLLOWING THE MYSTERIOUS ONE BEST WAY©® » (which basically boils down to use the latest iOS on the latest Mac Mx).

You see how my double punctuation marks have all a space in front ?

It is because my typography reflex are inherited from french language and I find it clearer. Typography is ONE of the many « ONE BEST WAY », and PEP8 enforces another one that contradicts mine. Making me consider « properly formatted » python code un-fucking-readable.

I have a master in physics and stumbled upon laplacians, differential and pretty complex equations, I have learned by strong burns that math are easier to read (FOR ME) when I deviate from the norm by adding MORE SPACES THAN REQUIRED.

You might rightly argue that what works for me IS NOT UNIVERSAL.

In greek there is a word for universal : catholic. It is a concept that has been heralded by chauvinistics genocidal embodiement of all what is wrong in our civilisation : « philosophs ».
Philosophs in Athens were richs sons of aristocrats and hiérarchs (holy persons) who thought that democracy was a threat to their unique Nature of « enlightened » piece of shit they were. They believed in absolute power of knowledge over everything and thought knowing maths was enough to have an opinion on every topic on earth : diet, pseudo-science, ruling a country, deciding who lives and who dies, what books to keep in libraries how to properly format writings and which one to throw away. They throw all books from the « sophists » who had the cardinal sin of opposing the « catholic » « one best way » in favor of the « πάντων χρημάτων ἄνθρωπον μέτρον εἶναι » (for all judgement man (he/she in its diversity) is the measure).

Hence, as a sophist I don't claim I am right, I claim I have the right to make a plea « as an equal human on the public place » for my case without being disregarded as « not following the « one best way ». Don't mistake me, I strictly don't want my views to be embraced by all as universal, nor disregard the others. I want to be talked as an equal on the agora.

All the answers I had were : we aren't gonna do anything because that's exactly WHAT THE NORM (unicode freaking hummongous mammothian riddled with more undefined behaviour than a C++ standard) SAID.

You know what, as a coder that HAVE to redact the web page that involves situation in which you click and it ties you to an immediate non reversible CONTRACT, I surely do advise you discuss with me.

This attitude towards discussion : always invoquing « the norm » first and refusing discussion with heterodox is undistinguishable from a clergy.
Fuck the norms, fuck the existing de-facto clergy of educated CS graduate, fuck the MPAA, the RIAA, unicode, and what the industry « wants » : let's talk about the freaking consequences for the « anthropos » (the animal called men and women) that must deal with the chaos of less than satisfyingly baked cathedral of norms.

And unicode is just the tip of the iceberg, PS, PDF, Tk and SVG deserves their OWN topic because, to me internet as a way to print knowledge and helps the « vulgus pecum » (the mass of persons not speaking ivy league english) share informations is being barred from exactly this by the norms themselves.

PDF and unicode are the opposite of Gutenberg press : neither a revolution of simplification of the essence of the written language (french happily lost a lot of letters and ligature in the process of Gutenberg Revolution), nor an ease of printing but they are WALLS of useless complexity that makes me regret the existence of daisy printers.

Modern web technologies is a dyschronia where the diffusion of information is being ruled by people with the Catholic church mindset opposing the ideals of Gutenberg. Modern web technology and computing is an effective dysfunction of what education should thrive for : emancipation of the mass by letting people exchange information the easiest they can and build their own tooling.