Building a CHRONO SOCIOGRAM from real world data

My movies are but the capture of a fleeting moments of life that takes place in a 1:1 map of human relationships. The map is a useful simplification of the overwhelming complexity of a territory in which I strive to explore human emotions.

Ingmar Bergman
In previous episode (last 2 posts) I did explore HOW TO represent links between social animals to try to infer who knows who and how these animals organize themselves as a herd : it is a sociogram.

It's a map, and a bad one at this. You see as in H2G2 (hitchiker's guide to galaxy) it's not the map that make the story, but construction work that changes the map. To « prove » my point I will study real life data for which (ethical/legal disclaimer the victim of the data breach and THE CREAM OF ALL JOURNALISTS admitted there was nothing. The hacked team said it was an honeypot (which will prove to be quite true), and nothing of interest was there (which is quite false). Hence, if you want scandals or REVELATIONs, sorry to disappoint you, sociogram don't unfold secret conspiracies.

I must also say I am a sysadmin first. My soul burns when exposed to private email contents, so ... I love sociograms that are in the grey area : you look the enveloppe without reading the mail body.

In my database structure you saw I DO populate my database with the text_mail, and you guessed I did use real data since day 1. Coding out of abstraction is nice, but nothing beats real data to actually stress out your abstration and torture your tools, especially mails that are per RFC2822 the most annoying archive format ever to parse (that I love, I began as a sysadmin specialized in mail).

Without further ado I will show you the result and explain how it was built thanks to postgres FTS (word count does not count as reading email body, doesn't it ?).

Legal disclaimer

It is a sociogram of Person Of Interests from the macronleaks. I will define a POI as : either hacked or (my own definition) any public servant that by being a is submitted to article 14 of constitution Article 14
Tous les citoyens ont le droit de constater, par eux-mêmes (...) la nécessité de la contribution publique, (...), d'en suivre l'emploi.
The payroll of public servants being public contribution submitted to « devoir de réserve » (hence NOT INFLUENCING election), I am entitled to check if public money was indeed used with my constitutional rights at heart.

Aristotle said elective systems are the opposite of democracy : democracy is the ruling of the people by ANY people, the elections introduce a bias of selection of the candidate by chosing who has access to the tribune hence making vote a biased choice in a biased subset of the people. I will add Aristotle was an idiot.
Democracy is not a STATE function, it is a PATH function. Democracy is a path and a journey made of polemics of improvements. Elective systems represent a f**d up state of the situation but it doesn't prevent democracy to exists has long as «a good huguenot's belief » everyone can discuss as an equal. I BELIEVE IN REFORMS, but I don't go to the temple (protestants bore me to death) (cf Frankfurt school//Habermas).

So let's see if my constitutional rights to choose from the largest democratically large subset of population was respected or did public servants entitled to enforce the spirit of the Constitution meddled with my right to choose from a larger subset of candidate than during last elections. Is the delta « positive » in term of democracy ?

TL;DR : the reform I would want after this study would be for administration to make ALL enveloppe of public servants mail and private sector public. I would also like to have all public communication in a database I can query without rights of elected members/public servants to DELETE them.

Hence here is a real life sociogram of public servants//elected members during Macron's campaign.

Real life feedback

At the opposite of developpers I know when a tool sucks. This MAP SUX.

A good map is easily readable and does not clutter the screen with TOO MUCH INFORMATION. I find it nice and readable, because I have been trained to read the functions of a chip by looking at the silicon with a microscope, and compared to a SoC, this is peanuts. It's not even has complex has modulator/de-modulator (MODEM). However, I am also human. It's really a mess.

You may criticize people with nice words, I don't coat my words with sugar : it is plainly unusable because of H = k . ln(S). There is too much choices that are not relevant.


This is a sociogram from may collected on a 2 year span, hence it is making stuff that were fleeting moments persist over time. It has good sides : you can guess the galaxy of influences by looking at the constellations of contact near the person that were hacked : ministry of the finances, minister of public health, minister of royal importance (justice/interior), municipality of capital (a state in the state), and military industry.

It has the same value has a persistant vision : making a blurry map with a feeling of what is there, but no details.

Hence, we are gonna introduce another tool : that complements the sociogram with ... a CHRONO-SOCIOGRAM. MOVING PICTURES of SOCIOGRAM. V1 is ugly, but good enough to give an idea of what comes next: a freakinig chrono sociogram where nodes stand still and where I will only higlight temporarly the edges.

I will not sugar coat my words : it is still shitty as hell, BUT, at any given time a normal brain can often see what matters : who talked to whom when ? Which for shit diggers having access to the content of email would be the tool to ease the inquiry job of isolating the informational window in which to scrutinize contents. BUT I WILL NOT ! I am a PROUD SYSADMIN, muttafuckas !

But how did you found the POI in the first version btw ? Cheater ?

You can predict everything except the future, hence the reason in applied physics we make prediction on past data so we can foresee the results.

Niels Bohr
I don't cheat, I used science. I put in all members of government as POI and spoke person of it, knowing it was already public they worked in the campaign. But I discovered soon, knowing the future was not necessary.

Let me introduce : POSTGRESQL FULL TEXT SEARCH's FEATURES : not trying to add words in a corpus based on inference.

Once you google for regex postgres array you have a neat function ~#!@ that is very useful and you can fire the word count and select emails by frequency from a point.

Knowing that the president real email is ALWAYS in the form e2m@ it gives :
ml=# select * from ts_stat($$select to_tsvector('french', text_plain) from mail where  '^e2m.*' ~!@# any("from") order by date ASC$$) where word ~* '.*@.*' order by nentry desc;
                  word                   │ ndoc │ nentry 
─────────────────────────────────────────┼──────┼────────
 e2m@en-marche.fr                        │   18 │    115
 barbara.frugier@gmail.com               │   11 │    104
 clement.beaune@gmail.com                │   11 │    102
 ismael.emelien@en-marche.fr             │   19 │     36
 sylvain.fort@en-marche.fr               │    8 │     16
 benjamin.griveaux@en-marche.fr          │    6 │     12
 julien.denormandie@en-marche.fr         │    6 │     12
 brigitte.macron@en-marche.fr            │    6 │     12
 sibeth.ndiaye@en-marche.fr              │    6 │     12
...
So when president is in from these are the most likely emails to have been quoted.

Word count does not count has reading emails body content, for real especially with a filter on "@".

And, step by step, you build the sociogram by doing a top most quoted email using the first ring, the second ring and then you stop because it is already much...

Nothing magical, just stupid out of the box function from the database used in a perverted way.

Since these leaks are old news, you can then correlate with what you know.

  • barbara was indeed the PR
  • clement became a spokeperson of the President or a minister
  • ...
  • sylvain the ghost writer who wrote « revolution » for the president
Being as sleaky as google/Meta does not require brain. Just building tools.

At this point the code (embedded data not counted) for making « a movie » is still 100sloc and I code AT BEST 10 sloc a day. I'm a messy sloth.

Here is the code for build the movie
pushd out
for i in *dot; do dot -Tpng $i > $( basename $i .dot).png; done
for i in rec.????.png; do convert $i -resize 2000x1400! re.$i; done
rm output2.mp4
ffmpeg -framerate 1 -i re.rec.%04d.png -c:v libx264  -crf 30 -pix_fmt yuv420p output2.mp4 && firefox output2.mp4
popd 
And here is the code for building the sociograms :
#!/usr/bin/env python3

import os
import psycopg2
from datetime import date, datetime, timedelta
from archery import mdict

def int_env_default(var, default):
    return int(os.getenv(var) or default)

MIN_MAIL = int_env_default("MIN_MAIL",6, )
MAX_MAIL = int_env_default("MAX_MAIL",100)
WL_MIN = int_env_default("WL_MIN", 3)
CUT_SIZE = int_env_default("CUT_SIZE", 20)
DATE = os.getenv("DATE") or "2016-01-01"
END_DATE = "2017-05-01"
BY_DAYS = int_env_default("BY_DAYS",4) # 13x28 = 364 ~ 365.5
NB=1

end_date = date.fromisoformat(END_DATE)
date = date.fromisoformat(DATE)
td = timedelta(days=BY_DAYS/2)
td2 = timedelta(days=BY_DAYS/2)



                
def is_ilot(node:str, edge_dict:tuple) -> bool:
    """ilot == has only 1 link back and forth either in (from,) or (,to)"""
    count=0
    for edge in edge_dict.keys():
        if node == edge[1] or node == edge[0]:
            count+=1
        if count > 2:
            return False
    return True

patt_to_col =  dict({
    "e2m":"red",
    "emmanuel.macron":"red", 
    "emmanuelmacron":"red",
    "alexis.kohler" : "midnightBlue",
    "gabriel.attal" : "orange",
    "sejourne.stephane" : "grey15",
    "stephane.sejourne" : "grey15",
    "olivia.gregoire" : "darkOrange",
    "veranolivier":"green",
    "julien.denormandie" : "lightBlue",
    "sibeth.ndiaye" : "orange",
    "barbara.frugier" : "green",
    "cedric.o" : "purple",
    "gouv.fr" : "maroon",
    "snecma.fr" : "purple",
    "safran-group.fr" : "purple",
    "benjamin.griveaux":"blue",
    "laurent.bigorgne" : "darkBlue",
    "jean.pisani-ferry": "yellow",
    "luc.pisani-ferry": "yellow",
    "ismael.emelie" : "orange",
    #"jesusetgabriel.com" : "crimson",
    "gregoire.potton" : "lightGreen",
    "eric.dumas":"salmon",
    "alexandre.benalla" : "darkGreen",
    "pierre.person" : "darkBlue",
    "pierrperson" : "darkBlue",
    "quentin.lafay":"grey10",
    "fm.alaintourret" : "purple",
    "@paris.fr" : "orange",
    "langanne" :"pink",
 })
    

wl = lambda s : any(map(str.startswith, patt_to_col.keys() ,s))
def in_wl(mail : str):
    for l in patt_to_col:
        if mail.startswith(l) or mail.endswith(l):
            return l

def wl(pair: tuple):
    for l in patt_to_col:
        if in_wl(pair[0]) and in_wl(pair[1]):
            return patt_to_col[in_wl(pair[0])]
#assert wl(("jesusetgabriel.com", "jesusetgabriel.com")) == "crimson"

is_vip = lambda t:all(map(in_wl, t))
fn=0
while date < end_date:
    fn+=1
    direct=mdict()
    final = mdict()
    conn = psycopg2.connect("dbname=ml host=192.168.1.32 port=5432 user=jul  sslmode='require' ")
        
    with conn.cursor() as sql:
        sql.execute(f"""SELECT "to", "from" from mail where DATE BETWEEN '{date}' AND '{date+td}';""")
        while t := sql.fetchone():
            for fr in t[0]:
                fr=fr.strip()
                for to in t[1]:
                    to=to.strip()
                    if fr != to and fr and to:
                        direct += mdict({ (fr,to) : 1 })

    date += td2
    tk= list(direct.keys())
    def has_more_than_n_neighbour(email: str, n :int, final : dict):
        count = 0
        for k in final.keys():
            if len(set([email]) & set(k)):
                count+=1
                if count >n:
                    return True
        return False
            

    for k in tk:
    # dont modify a dict you iterate hence copy of keys
        if is_vip(k) and k not in final  and k[::-1] not in final: # or wl(k) and (k[1] != k[0] and k[1] and k[0] and k in direct and k not in final and k[::-1] not in final and k[::-1] in direct \
               
            final[k]=direct[k]
            final[k]+=direct.get(k[::-1],0)
        else:
            try:
                del(direct[k]) 
            except KeyError:
                pass
            try:
                del(direct[k[::-1]]) 
            except KeyError:
                pass


        
    tk= list(final.keys())

    for e in tk:
    # dont modify a dict you iterate hence copy of keys
        if not has_more_than_n_neighbour(e[0],NB,final) or not has_more_than_n_neighbour(e[1],NB,final):
            try:
                del(final[e]) 
            except KeyError:
                pass
            try:
                del(final[e[::-1]]) 
            except KeyError:
                pass



    color = "".join([  f"""{i[1]} pour {i[0]}{[", ",chr(0x0a),][(n%4)==3]}  """  for n,i in enumerate(patt_to_col.items()) ])
        

    conn.close()
    with open(f"out/rec.%04d.dot" % fn, "w") as f:
        title = f"""label="Sociogramme de {date} à {date + td} extrait des macron leaks orienté gouv.fr, personne d'intérêts (vert), victime du hacking (rouge), et président (bleu)\n
    entre [ {MIN_MAIL}, {MAX_MAIL} ] échangés \n
    plus gros liens au dessus de {CUT_SIZE} mails échangés entre interlocuteurs \n
    couleur par priorités selon les origines \n
    {color}
"""

        print("""
    graph Sociogramme { """ + f"""
    fontname="Comics sans MS"
    outputmode="nodesfirst"
    size=20
    label="Sociogramme de {date} à {date + td}"
    ratio=1.7
    labelloc="c"
    labelloc="t";
    """ + 
    "\n".join(['  "%s" -- "%s" [label=%d color=%s penwidth=%d ];' % (k[0],k[1],v, wl(k) or "black", 1 if v < CUT_SIZE and wl(k) != "red" else 4 ) for k,v in final.items()])
     + "}", file=f)


PS : I'm totally sure I saw during FSM 2005 or 2004 (20 years ago) exactly this about python commit during a conference. A lot of kudos to whoever can find the link and say : boooooh it's not new imposteur ! PPS : next blog post solving the graphviz variable problem that f*** ffmpeg with a fstring and being inventive in less than 20 sloc :D
I'm totally gonna dump the full stuff and use perl to make a template in f"" with perl :D trololololol

No comments: