Building a sociogram from mail archives in python and postgres

TL;DR : « I know that as john snow, I know nothing, and it is not a problem »

Foreword : typing in CS is WRONG



People are different. I am, you are, he/she are, it are.

I don't talk about gender, I talk about mileage, life experiences and shit.

Me, I come from being a half thug, half educated (D&D bi-classed) including the local kingpin of forging fake documents for « école buissonière » (skipping school) with my early 90s computer, cheating with pctools, being the kingpin of a ring of cracked software from ranging from amiga to powerPC.

I learned coding not as an educated CS 101 but as a dyscalculic and dyslexic student in physics redeemed by his facility in foreign language especially if they have a germanic touch (french, english, german).

Foreign languages have taught me the power of composition of pre/post positions that are consistent like dé-ménager is opposite on en-ménager which is same as moving-out compared to moving-in.

Also, programming in physic is not the same as programming in CS. We don't have budget for programming hours, so our teacher don't care about elegance and high level programming or low memory consumption. They want easy to read and debug code that works fast. Also, physics has long resolved the typing debate : computer academics are just so high they don't get it.

See : if I have a unit in meter per seconds and I add kilograms, we want the code to crash.

Strongly typed language (including ADA or VHDL) don't have « symbolic consistency » but DATA TYPE consistency. And I will illustrate it in the code to come with the
is_ilot
function.

Mindset and tools : why I did not use overwhelmingly beautiful postgres features but wasted my CPU and memory doing the sociogram in python (and why not in Perl)

My problem is simple : imagine I have a sample database to have fun containing real life email and I want to make a graph out of it. Like an Xleaks from wikileaks, and I want like Meta, or google to infer who are the interesting persons interacting out of the thousands of interlocutors. So let's dump my SIMPLE sql schema :

DROP TABLE IF EXISTS public.mail;

CREATE TABLE IF NOT EXISTS public.mail
(
    filename text COLLATE pg_catalog."default" NOT NULL,
    "from" text[] COLLATE pg_catalog."default" NOT NULL,
    "to" text[] COLLATE pg_catalog."default" NOT NULL,
    subject text COLLATE pg_catalog."default",
    attachments json,
    text_plain text COLLATE pg_catalog."default",
    date date,
    message_id text COLLATE pg_catalog."default",
    thread_id text COLLATE pg_catalog."default",
    CONSTRAINT mail_pkey PRIMARY KEY (filename)
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.mail
    OWNER to jul;


A sociogram is a graph made of relationships (edges) between persons (nodes) with their strength expressed in number of occurences between persons.

I basically want
for each mail
   for the cartesion product of  to and from
       add an edge in graph
I use the language of building microchips I learned in applied physics of micro-electronic here. Not a computer stuff I cannot visualize in my f***d up brain.

The problem are « HUMANS ». They chat to a lot of persons but there are 2 kinds of persons : persons who linked other persons and persons who are just noises having a 1:1 relationship.

Analysing visually thousands of nodes in a graph is doable, but tiring. So we have to weed out nodes that are useless. People who are just leaves in the graph having only a bijective relationship.

Digression on the choice of tools


I love posgres, but I am dyslexic and don't use it often. I have as an only tool psql to build request and stack overflow to answer my questions involving recursive request on array to build a cross product.

Stack overflow is good at solving one problem. But composition of answers is tough, especially when I use reserved WORDS in postgres as column names. And you are gonna wonder why ?

MY OWN STRONG TYPING : name a thing by it's real name and PUT UNITS in their name if you can.

Also, physics taught me to avoid recursion at all costs because .... see point #1, stacked base thinking does the same for less bugs and head/p overflow (I come from an era where size of heap for call recursion where laughingly low on linux, C, Perl).

So I did my own code in python wasting memory and CPU for something I am well aware there was an elegant, faster solution in postgres, but early optimization or early result was a fast choice : I want for gruik coding without hesitation. Why did I used postgres ? Parsing email is a cost of HOURS. Getting the « to »s and « from »s time consuming. I totally amortized postgres for the use as a more efficient DATASTORE than the filesystem. No regrets here.

The code : KISS



Building a sociogram weeding out the « ilot » (edges connected to only one and only one other edge) is fairly easy once you coded the
is_ilot
function. It has been 2 full hours of work for a problem I never faced before applying academic knowledge of micro-electronic to a social graph.
import psycopg2
MIN_MAIL=20
MAX_MAIL=400
from archery import mdict

conn = psycopg2.connect("dbname=ml")

direct=mdict()
final = mdict()
    
with conn.cursor() as sql:
# cartestion product done the dumbest way possible
    sql.execute("""SELECT "to", "from" from mail""")
    while t := sql.fetchone():
        for fr in t[0]:
            for to in t[1]:
                direct += mdict({ (fr,to) : 1 })

                
def is_ilot(node:str, edge_dict:tuple) -> bool:
    """ilot == has only 1 link back and forth either in (from,) or (,to)"""
    count=0
    for edge in edge_dict.keys():
        if node == edge[1] or node == edge[0]:
            count+=1
        if count > 2:
            return False
    return True
    
tk= list(direct.keys())
for k in tk:
# weeding out pass 1
# dont modify a dict you iterate hence copy of keys
    if k in direct and k[::-1] in direct \
            and MAX_MAIL> direct[k]+direct[k[::-1]]>MIN_MAIL:
        final[k]=direct[k]
        final[k[::-1]]=direct[k[::-1]]
    else:
        try:
            del(direct[k]) 
        except KeyError:
            pass
        try:
            del(direct[k[::-1]]) 
        except KeyError:
            pass


    
tk= list(final.keys())
for e in tk:
# dont modify a dict you iterate hence copy of keys
# weeding out pass 2
    if is_ilot(e[0],final):
        try:
            del(final[e]) 
        except KeyError:
            pass
        try:
            del(final[e[::-1]]) 
        except KeyError:
            pass

    if is_ilot(e[1], final):
        try:
            del(final[e]) 
        except KeyError:
            pass
        try:
            del(final[e[::-1]]) 
        except KeyError:
            pass



conn.close()
# output for graphviz
print("digraph Sociogramme {")
print("\n".join(['"%s"->"%s" [label=%d];' % (k[0],k[1],v) for k,v in final.items()]))
print("}")


You will notice I « typed » the python function. It took me 20% of the time because I was still in postgres thinking of postfix typing (::array) and not in infix typing (list :) and it's because I had ... a bug to solve. At first I coded the function with one letter name as I usually do until I have a problem. I backtracked it, changed the name and put the typing annotations for fun but what really helped me was : the only strdoc in the code and naming. Remembering what was my purpose and what edges and nodes were. As soon as I named correctly the function, the inputs, and wrote the strdoc it was done magically and I laughed at how typing was missing the point.

a_dict could be a list of keys, a sparse matrix, a mutable mapping, a string. The type of data structure I use to represent a node or an edge is so varied typing does not help.

Here is a sample of the output before after the weeding out with
is_ilot

Before

After

Final word

I want this post to be an advisory to noobs like me to not care about academics but let the accident of programming not rebute them for going the way they see fit in a Keep It Simple Manner that « work for you »©®

No comments: