Magic in python and Verschlimmbesserung

 The other day, I was scrapping the national treasure of my country : public domain reviews. In this case l'assiette au beurre.


Being broken as Job (trying to reach religious enlightenment), my computers are pretty slow and using the web site that first load 19megabytes of json to load a poorly made online reader is a no-go.

Long story short, this hellish 9 level json with keywords in french and in english are not setting a good mood when you want to have a result fast, and well, this sites includes a tracking probe from the (in)famous xiti that cannot be opted out (non RGPD conformity). Do you mind if I can read anonymously ?

 So much « savoir faire » from french elite coders and sysadmins. Lol.

And sometimes the web server lies, either putting JSON (application type = JSON) in an HTML document with the doctype and all.

So since scrapping this content is for me a kleenex script (write only, use once, throw away), and the web site is sometimes saying here is your jpeg, but instead give you a valid html document instead saying the server could not serve the jpeg, I had to do magic number detection \o/

So it is not a rant, it is actually the depressing state of the art of actual web sites.


Is so much context important ?

Yes my lords and ladies. Context in matter of detection ALWAYS matter. It helps you build your cost matrix.

Detection has costs : the one to properly detect or improperly detect.

Let's go through the rabbit hole on non determinism in computers and the impossibility to actually detect a file reliably.

When at work you do enterprise grade code, the one that will pass the review without the clergy of code frowning an eye for not doing the one best way.

What is enterprise « one best way » ?


Well, don't reinvent the wheel and use the most accepted tool for the job.

 Being on linux I could do file on the download picture, a C program that is always installed when you install linux (libmagick1)

Plus, the maintainer of the file C library provide a very nice integration using ctypes (a tool under appreciated letting you use C libraries natively from python).

https://github.com/file/file/blob/master/python/setup.py

 Actually the source from file indicates file-magic, but there is 1440 python packages with magic in the name

There is the more downloaded python- magic than file-magic that is actually a rewrite of the original without mention of the original source, even the licence has changed, lol. But it has more badges, looks nicely reactive and the author is very nice.

https://github.com/file/file/blob/master/python/magic.py

compared to this it is full of lol

https://github.com/ahupp/python-magic/blob/master/magic/__init__.py


Well reading the code of the rebranded version is actually funny. It uses a lock, it seems to has global states (set_params) that is smelling of non thread safe issues.

And you are having a headache without even beginning the import. And then you remember that JPEG can be detected by parsing the four first bytes.

The famous magic number.

So when you write kleenex code that is meant to be thrown away in a well understood context where you know for sure that looking at the first 4 bytes of the content

magic_jpeg =[ 255, 216, 255, 224 ]
got_jpeg = list(map(int, requests.get(reformat(im_url, index)).content[:4])) == magic_jpeg

I could have just compared the first 4 bytes with a bytearray without going through the trouble of making a list of int, sure. I come from basic PEEK/POKE dance on 6502, and I prefer to have my raw data as literal int, I am used to it, and find it readable.

This is definitely the code I do at home when I am relaxed, when PEP8, good choices that have to be disserted with balls breaking academics in written oxbridge english is not a topic.

And it feels great.

So I hear quite a few person screaming that libmagic does way more than just grepping the first characters and compare them to a list.

Welcome to the fantastic rabbit hole of « the computer world  doth not maketh sense ».

The naïve would expect a database of magic number to exists and to be able to compare the beginning of the string with these pattern and deduce the type of the file.

#!/usr/bin/env python3
from requests import get
import re
""" # building the dict from parsing with a regexp for fun
content = get("https://gist.githubusercontent.com/leommoore/f9e57ba2aa4bf197ebc5/raw/e59c296951e0588509b1f777d1f98b2ce08272ad/file_magic_numbers.md").content.decode("utf8")

f = re.compile(r"""
<tr>\s*<td>(?P<type>.+)</td>\s*
<td>[^<]+</td>\s*
<td>(?P<magic>[a-f0-9\ ]+)</td>
""",re.MULTILINE|re.VERBOSE| re.I)
magic_to_type = dict(
	map(
    	lambda t: (tuple(map(lambda st:int(st,16), t[1].split(" "))), t[0]),
        s.findall(content.decode("utf8")
)))
"""
magic_to_type = {(66, 77): 'Bitmap format',
 (83, 73, 77, 80, 76, 69): 'FITS format',
 (71, 73, 70, 56): 'GIF format',
 (71, 75, 83, 77): 'Graphics Kernel System',
 (1, 218): 'IRIS rgb format',
 (241, 0, 64, 187): 'ITC (CMU WM) format',
 (255, 216, 255, 224): 'JPEG File Interchange Format',
 (73, 73, 78, 49): 'NIFF (Navy TIFF)',
 (86, 73, 69, 87): 'PM format',
 (137, 80, 78, 71): 'PNG format',
 (37, 33): 'Postscript format',
 (89, 166, 106, 149): 'Sun Rasterfile',
 (77, 77, 0, 42): 'TIFF format (Motorola - big endian) ',
 (73, 73, 42, 0): 'TIFF format (Intel - little endian) ',
 (103, 105, 109, 112, 32, 120, 99, 102, 32, 118): 'XCF Gimp file structure',
 (35, 70, 73, 71): 'Xfig format',
 (47, 42, 32, 88, 80, 77, 32, 42, 47): 'XPM format',
 (66, 90): 'Bzip',
 (31, 157): 'Compress',
 (31, 139): 'gzip format',
 (80, 75, 3, 4): 'pkzip format',
 (117, 115, 116, 97, 114): 'TAR (POSIX)',
 (77, 90): 'MS-DOS, OS/2 or MS Windows',
 (127, 69, 76, 70): 'Unix elf',
 (153, 0): 'pgp public ring',
 (149, 1): 'pgp security ring',
 (149, 0): 'pgp security ring',
 (166, 0): 'pgp encrypted data'}

def magic_detect(a_byte_array):
    for pattern, name in magic_to_type.items():
        to_read=len(pattern)
        if len(a_byte_array) < to_read: continue
        if tuple(a_byte_array[:to_read]) == pattern:
            return name

print(magic_detect(get("https://www.python.org/static/img/python-logo.png").content))
#prints 'PNG format',


Well, you could be brave and do man magic, and discover filemacgic actually resort to a description language with states. I don't know if it's turing complete but it requires a lot of code to run. man file also tells you that if compiled « correctly », file can be sandboxed for security reasons. Scary issues for just grepping a few bytes to recognize the type of a file.

 

man magic

Well, for some formats, the only way to be sure something is what it is is ... to actually open it with the appropriate application.

For instance, you remember editing PDF (that are just PostScript in fact) and seeing %PDF ?

In practice years of ignoring this magic number resulted in most tools for PDF to not care.

So the only way to know if a file is a pdf is to try to use it. And PDF, hence Postscript is not only a language, but it is a also a virtual machine with job control, and eventually access to the files ! Lol.


How?

In the long lost time of printers being piloted with microcontrolers, PostScript is a forth. To handle everything (including configuration font loading, drivers), hardware vendors mutualized the code. Hence CUPS PPD looks like PostScript.

Its because there is a long way between what normalization are saying (when you can have access to them) and what tools are actually doing. As a result two distinct tools may disagree on the fact you have a valid file. Like for instance ... a C file ? :D Until it compiles you cannot tell if a C file is actually a C file.

Hence, magic numbers are a good enough tool when I want to tell a picture from an octet streem for most of the use case.

But, for stuff like archives (tar, zip), you have to actually run an archiver. Which involves quite a few risks like zip bombs (you craft an archive to decompress Petabytes of one char).

So we agree that for scrapping once a website to tell a picture from an html document looking at 4 first bytes « is good enough » and using a library is overkill.

Now, you understand that certain file formats (PDF, xml, archive) may require to be told to be opened by an application to check their real nature. But, that the cost of better detection : it may be a security breach (when better is worse). There is of course a german word for it Verschlimmbesserung. When improving makes things worse.

More interestingly some formats are archives that can be fun like openoffice documents. In which you can insert PDF, lol.

The bloatware we have created is making the idea of preserving our digital data foolish. Either the data we want to preserve could be wormed and doom all our vault of precious knowledge by infecting the console to read them, or we will rely on piece of software that are the de facto norm (hello adobe) since norms are either not shared to the public or tools made by the company are diverging way too much of the norms creating a potential for a lot of uncertainties.

At the end of the day, I think file type detection with magic numbers is pretty much an as good idea as getting data from XML with a regexp.

The stuff you are doing in a company sometimes don't make sense because there is a peer pressure. And when home you can can just relax, enjoy the discovery of all the complexity behind concepts we deem « so trivial ».  Sometimes you want to use a regexp to parse XML for a specific pattern, at least, lol, you are not vulnerable to a billion lol attack.

 

But, seriously, centuries after the invention of the  « portable » printer, making it easy to print « en masse » by Gutenberg, the computer have failed miserably at finding a good substitute to his invention, and I am pretty scared when I see that for sending basic jobs to print (libreoffice, PDF, PS, whatever) at the opposite of Gutenberg that created a massive disruption by simplification, we are spiraling out of control in more and more complexity that don't result in information being shared easily and masses having access to knowledge and information.

There is something of a failure in modern computing that makes it brittle without all the cash poured massively.

But remember imageMagick ? The famous lib maintained by a random Person in Nebraska since 2003 ?

PDF conversion was denied on debian by this package, because ghostScript an another under maintained project for postscript interpretation that had a security hole software that is massively used by printing solution vendors and might not have been updated on IoT ...


And as of today the debian distribution I use is vulnerable because of this package... without a fix.



No comments: