Foreword what is yahi?
Yahi is a python module that can installed with pip to make a all in one static html page aggregating the data from a web server. Actually, as shown with the parsing of auth.log in the documentation, it is pretty much a versatile log analyser based on regexp enabling various aggregation : by geo_ip, by histograms, chronological series.
The demo is here. It pretty much is close to awstats which is discontinued or goaccess.
As the author of yahi I may not be very objective, but I claim it is having a quality other tools don't have : it is easy to abuse for format other than web logs and here is an example out of the norm : parsing CSV.
Ploting histograms or time series from CSV
CSV that can be parsed as regexp
There are simple cases when CSV don’t have strings embedded and are litteraly comma separated integers/floats.
In this case, CSV can be parsed as a regexp and it’s all the more convenient when the CSV has no title.
Here is an example using the CSV coming from the CSV generated by trollometre
A line is made off a timestamp followed by various (int) counters.
Tip
For the sake of ease of use I hacked the date_pattern format to accept “%s” as a timestamp (while it’s normally only valid strptime formater)
from archery import mdict
from yahi import notch, shoot
from json import dump
import re
context=notch(
off="user_agent,geo_ip",
log_format="custom",
output_format="json",
date_pattern="%s",
log_pattern="""^(?P<datetime>[^,]+),
(?P<nb_fr>[^,]+),
(?P<nb_total>[^,]+),?.*
$""")
date_formater= lambda dt :"%s-%s-%s" % ( dt.year, dt.month, dt.day)
res= shoot(
context,
lambda data: mdict({
"date_fr" :
mdict({ date_formater(data["_datetime"]) :
int(data["nb_fr"]) }),
"hour_fr" :
mdict({ "%02d" % data["_datetime"].hour :
int(data["nb_fr"]) }),
"date_all" :
mdict({ date_formater(data["_datetime"]) :
int(data["nb_total"]) }),
"hour_all" :
mdict({ "%02d" % data["_datetime"].hour :
int(data["nb_total"]) }),
"total" : 1
})
)
dump(res,open("data.js","w"), indent=4)
Then, all that remains to do is
python test.py < ~/trollometre.csv && yahi_all_in_one_maker && firefox aio.html
You click on time series and can see the either the chronological time serie

Or the profile by hour

Raw approach with csv.DictReader
Let’s take the use case where my job insurance sent me the data of all the 10000 jobless persons in my vicinity consisting for each line of :
opaque id,civility,firstname, lastname, email,email of the counseler following the job less person
For this CSV, I have the title as the first line, and have strings that may countain “,”, hence the regexp approach is strongly ill advised.
What we want here is 2 histograms :
- the frequency of the firstname (that does not violates RGPD) and that I can share,
- how much each adviser is counseling.
Here is the code
from csv import DictReader
from json import dump
from archery import mdict
res=mdict()
with open("/home/jul/Téléchargements/GEMESCAPEG.csv") as f:
for l in DictReader(f):
res+=mdict(by_ref = mdict({l["Referent"]: 1}), by_prenom=mdict({l["Prenom"]:1}))
dump(res, open("data.js", "w"), indent=4)
Then, all that remains to do is
yahi_all_in_one_maker && firefox aio.html
And here we can see that each counseler is following on average ~250 jobless persons.

And the frequency of the firstname

Which correlated with the demographic of the firstname as included here below tends to prove that the older you are the less likeky you are to be jobless.
I am not saying ageism, the data are doing it for me.



No comments:
Post a Comment