Datastore and database : why it is a good idea to not confuse both.


I strongly advise to fast watch this video, since datastore and database, as well as hierarchical database are covered there.

Now let's come back in 2024 and wonder, well, is this still relevant.

I recently had fun taking a simple use case that made one postgresql contributor famous : Julian Assange Xleaks.

Should we put mails in database or keep the database as an index ?

As a foreward Python mail parsing is infamously not on par with Perl from which it has ported its libs. We will imagine we use a thin layer on top of it know as mail parser and that our mail comes from google, outlook, and thunderbird archives.

After all, mails are so important in our modern life that the use case of analysing mail with a database, or a nosql or .... is important.

My favourite « kind » of non database data-base are like LDAP hierarchical tree like datatype. They fit verey well with my aggregation lib for dict in python (archery a good way to shoot yourself an arrow in the knee). And of course, POSTGRES DOES IT, POSTGRES DOES EVERYTHING BETTER THAN ANYONE. (I intend it as a troll, but, in the bottom of my heart, postgresql and sqlite are my 2 favourite databases).

However mails ARE relationals thanks to from/to, and thread-id. If thread reconstruction may favour hirearchical databases, your unit of search will be mail and wanting to mail a many-to-many relationship like « find all mail where X and Y are linked.

One approach on parsing 1000th of email is relying on an SSD and brute forcing your way through mail every time. But, lol, I'm poor, I have an hard drive wich I fear for its life with all my precious photos of my family, so I prefer not too. Plus it's long. And that's when you remember surfing the paperspace and the liibrary analogy.

Database is supposed to be the cabinet in the entrance where you can find all the name of the books that are physically located in the library and where they are. Ex : an author may be a co-author (like Good omens from pratchett in Fantasy, in SF (yes Pratchett wrote SF), and in fantasy.

Fetching a book is long, the cabinet (database) helps you find the books fast in the datastore (the shelf of the library). A database is pretty much a tool for indexing slow retrieval data that you want to find fast.

In 2024, is it still relevant ?

I beg to argue yes in the famous realm of « it works for me this way »©® (famous saying of linus torvalds).

Hear me out : mail parsing is shit.

Imagine, you have an email : per RFC, you can have MORE THAN ONE BODY (a mail is an n-relationship between enveloppe and body) and BODIES can be either TXT OR HTML and more often than not (thks to the creativity of the artists) may bear different messages since SOMETIMES what you see matters, hence, the text body is litteral garbage if you want to make sense of the mail.

Mail per RFC 2822 can have attachments that can embedd attachments that refers to each others (often in an mixed peek and poke of both an arrayish and tree-ish data structure.

Ho ! Is it perfect for postgresql XML/JSON ? NO ! Recursive descending SQL request may exists, but joining on non deterministic disposition is begging for losing too much brain cell : it is not because you can do it that YOU SHOULD DO IT. Sometimes the most reliable to access attachments is not from
parser(mail).attachments.to_json
(that is half backed) but with
parser(mail).write_attachments
that renders embedded attachments inside the documents in a more reasonable readable way. And sometimes more often than a python guess, what you want is a heavy client for READING MAIL that embedds decade of wisdom on how to deal with broken norms.

Hence, practically, when you deal with mails what you want to store in the database what you want to store is :
  • path to the DATA-STORE (hence HARD DRIVE)
  • from
  • to (BCC, CC for me are tos too, since I want to draw sociograms)
  • message-id
  • subject
  • thread-id
  • DATE which is a nightmare since it's a clear hell to PARSE
  • and because experimenting with JSON is fun the attachments metadata object/dict (application type, content type, filename) as an JSON
  • and because FTS (full text search that can survive typo) is fun the FIRST text body (but you will miss when astute individuals will use more than one)
Legal archiver and geeks MIGHT want to add : the chain of mail servers, and all data concerning validation to try to detect signs of spoofing. Which gives us almost this :
CREATE TABLE IF NOT EXISTS public.mail
(
    filename text COLLATE pg_catalog."default" NOT NULL,
    "from" text[] COLLATE pg_catalog."default" NOT NULL,
    "to" text[] COLLATE pg_catalog."default" NOT NULL,
    subject text COLLATE pg_catalog."default",
    attachments json,
    text_plain text COLLATE pg_catalog."default",
    date date,
    message_id text COLLATE pg_catalog."default",
    thread_id text COLLATE pg_catalog."default",
    CONSTRAINT mail_pkey PRIMARY KEY (filename)
)

TABLESPACE pg_default;
This is ALREADY speeding up without hammering my hard drive all of the following funny use cases : This said, I have already fun using postgres and shell doing sociogram by first ranking people in to/from relationship above a threshold and then drawing the sociogram.

Here is a simple request to see who are the top senders

cat <<EOM  | psql ml jul | less
SELECT DISTINCT ON (times_seen, element) element
    ,COUNT(element) OVER (
        PARTITION BY element
        ) AS times_seen
FROM mail
    ,unnest("to") WITH ordinality AS a(element)
ORDER BY times_seen DESC, element DESC;
EOM
Making histograms about how much « to » you have per mail

psql ml -c ' 
WITH len as (SELECT cardinality("to") as cto from mail) 
SELECT 
	width_bucket(cto, 1, 43, 10) as bucket, 
    count(*) as freq 
FROM len 
GROUP BY bucket ORDER BY bucket;
'
# too lazy to put the min/max functions

 bucket | freq  
--------+-------
      0 | 41083
      1 |  9965
      2 |  1096
      3 |   227
      4 |   116
      5 |    88
      6 |    12
      7 |    14
      8 |     4
      9 |     2
     10 |     2
     11 |    26
(12 rows)


Making dot diagram of who speaks to whom in a one to one relationship (strong link) when it it more than 20 times :
( echo "digraph G {" ; 
	PSQL_PAGER="cat" psql ml jul -ta -c "SELECT \"from\"[1] || '->' ||  \"to\"[1]  FROM mail WHERE cardinality(\"to\") = 1 and cardinality(\"from\") = 1; " \
    | sort | uniq -c \
    | perl -ane 's/^\s+(\d+)  (.*)\-\>(.*)/"\2" -> "\3" [label="\1"];/g; print $_ if $1 > 20'  ; echo "}" ) \
    | dot -Txdot | xdot -
And that's already neat. By treating the envelope as an orthognal data space than datastore, you treat a new shit-ton of informations that are normally hidden in the mail that is already (too) full of information. I stop writing this post to come back having fun with my database while I leave the mail in the cold of the data-store :D

Addendum : And I was furious that the SQL was so complex for querying reciprocal relationships in arrays that I came (since my dataset is small and I have A LOT OF MEMORY (4Gb) with a neat more exact python/SQL/BASH/xdot solution that will cringe any SQL/bash/python purist (lol) but that is a pipe of simple operations anyone can understand.
#!/usr/bin/env bash
MIN_MAIL=40
TF=$( mktemp )
PSQL_PAGER="cat" psql ml jul -ta -c "
SELECT \"to\"[1] || '->' ||  \"from\"[1] 
FROM mail 
WHERE cardinality(\"to\") = 1 and cardinality(\"from\") = 1; " > $TF
python -c "from archery import mdict" || python -mpip install archery

python <<EOF | xdot -
import re
from archery import mdict
patter=re.compile(r"""^ (\S+)\->(\S+)$""")


direct=mdict()
final = mdict()
    
with open("$TF") as f:
    for l in f:
        try:
            (fr, to) = patter.search(l).group(1,2)
            direct += mdict({ (fr,to) : 1 })           
        except:
            pass

tk= list(direct.keys())

for k in tk:
# dont modify a dict you iterate hence copy of keys
    if k in direct and k[::-1] in direct and direct[k]+direct[k[::-1]]>$MIN_MAIL:
        final[k]=direct[k]
        final[k[::-1]]=direct[k[::-1]]
    else:
        try: del(direct[k]) 
        except KeyError: pass

        try: del(direct[k[::-1]]) 
        except KeyError: pass

print("digraph Sociogramme {")
print("\n".join(['"%s"->"%s" [label=%d];' % (k[0],k[1],v) for k,v in final.items()]))
print("}")


EOF



No comments: