So I wrote a Proof of Concept language to address the problem of safe eval

I told fellow coders: «hey! I know a solution to the safe eval problem: it is right under my eyes». I think I can code it in less than 24 hours from scratch. It will support safe templating... Because That's the primary purpose for it.


TL; DR:


I was told my solution was overengineering because writing a language is so much efforts. Actually it took me less time to write a language without any theorical knowledge than the time I have been loosing in my various jobs every single time to deal with unsafe eval.

Here is the result in python : a forth based templating language that does actually covers 90% of the real used case I have experienced that is a fair balance between time to code and real features people uses.


You don't actually need that much features.

https://github.com/jul/confined (+pypi package)

NB Work in progress

 

How I was tortured as a student


When I was a student, I was nicely helped through the hell of my chaotic studies by people in a university called ENS.

In exchange of their help I had to code for data measurement/labs with various language OS, and environment.

I was tortured because I liked programming and I did not have the right to do OOP, malloc, use new language .... Perl, python, new version of C standards...

Even for handling numbers scientifics were despising perl/python because of their inaptitude to safely handle maths. I had to use the «numerical recipies» and/or fortran. (I checked in 2005 they tried and were disappointed by python, I guess since then they might use numpy  that is basically binding on safe ports of numerical recipies in fortran). I was working on chaotic system that are really sensitive to initial conditions ... a small error in the input propagate fast.

The people were saying: we need this code to work and we need to be able to reuse it, and we need our output to be reproducible and verifiable : KISS. Keep It Simple Stupid. And even more stupid.

So I was barred from any unbound resource behaviour, unsafe behaviour with base types.

Actually by curiosity I recompiled code that was using C and piping output to tcl/tk I made at this time to make graphical representation of multi agent simulations and it still works... It was written in 1996.

That's how I learnt programming : by doing the worst possible unfunky programming ever.  I thought they were just stupid grumpy old men.

And I also had to use scientific equipment/softwares. They oddly enough all used forth RPN notations to enable users some basic manipulation.

Like:
  1. ASYST
  2. RRD Tools
  3. pytables NUMEPXR extension http://code.google.com/p/numexpr
And I realized I understood:

FORTH are easy to implement:
  • it is a simple left to right parsing technique: no backtracking/no states;
  • the grammar is easy to write; 
  • the memory model makes it easy to confine in boundaries;
  • it is immutable in its serialization (you can drop exec and data stack and safely resume/start/transport them)
  • it is thus efficient for parallization,
  • it thus can be used in embedded stuff (like measurement instruments that needs to be autonomous AND programmable)
 So I decide to give me one day to code in python a safe confined interpreter.

I was told it was complex to write a language especially when like I do, I never had any lessons/interests in parsing/language theory and I suck at mathematics.


Design choices


Having the minimum dependency requirements: stdlib.

 One number to rule them all 

I have been beaten so much time in web development by the floating point number especially for monetary values that I wanted a number that could do fixed point calculus. And also I have been beaten so many time by problems were the input were sensitive to initial conditions I wanted a number that would be better than IEEE 754 to potentially control errors.
So I went for the stdlib IEEE 854 officious standard based number : https://docs.python.org/2/library/decimal.html
Other advantages: string representation (IEEE 754) is canonical and the regexp is well known. Thus easy to parse.

In face of ambiguity refuse to guess

I will try to see input as (char *) and have the decoding being explicit.
Rationale: if you work with SIP (I do) headers are latin1 and if you work in an international environment you may have to face data incorrectly encoded that can also represent UTF8 and people in this place (Québec love to use accents éverywhere). So I want to use it myself.

It is also the reason I used my check_arg library to enforce type checking of my operators and document stuff by using a KISS approach: function names should be explicit and their args should tell you everything.

Having a modular grammar so that operators/base types can be added/removed easily. 


I evoked in a precedent post how we cannot do safe eval in python because keywords and cannot be controled. So I decided to have a dynamic grammar built at tokenization time (the code has the possibility to do it, it is not yet available through the API).

Avoid nested data structures recursive calls


I wanted to do a language my fellow mentors could use safely. I may implement recursive eval in the future but I will enforce a very limited level of recursion. But, I see a solution to replace nested calls by using the stack.

Stateless and immutables only


I have seen so many times people pickling function that I decided to have something more usable for remote execution. I also wanted my code to be idempotent. If parsing is seen as a function I wanted to guaranty that

parsing(Input, Environment) => output 

would be guaranteed to be always the same
We can also serialize the exec stack the data stack at any given moment to change it later. I want no side effects. As a result there will ne no time related functions.

As a result you can safely execute remote code.

Resource use should be controlled


Stack size, size of the input, recursion level, the initial state of the interpreter (default encoding, precision, number behaviours). I want to control everything (that what context will be for and all parameters WILL have to be mandatory). So that I can guaranty the most I can (I was thinking of writing C extensions to ensure we DONT use atof/atoi but strtol/f ...).

This way I can avoid to use an awful lot of virtual machines/docker/jails whatever.

Grammar should be easy to read


Since I don't know how to parse, but I love damian conway, I looked at Regexp::Grammar and I said: Oh! I want something like this.

There are numerous resource on stackoverflow on  how to parse exactly various base types (floats, strings). How to alternate and patterns... So that it took me 3 hours to imagine a way to do it. So I still know nothing of parsing and stuff, but I knew I would have a result.

I chose a grammar that can be written in a way to avoid backtracking (left to right helped a lot) to avoid the regexp to be uncontrolled.

I am not sure of what it does, but I am pretty sure it can be ported in C or whatever that guarantees NO nested/recursive use of resources. (regexp are not supposed to stay in a hardened version this is just a good enough parser written in 3 hours with my insufficient knowledge).

I still think Perl is right


We should do our unittest before our install. So my module refuse to install if the single actual test I put (as a POC) does not pass.


Conclusion


So it really worths the time spent. And now I may be in the «cour des grands» of the coders that implemented their own language, from scratch and without any prior theorical knowledge of how to write one. So I have been geeking alone in front of my computer and my wife is pissed at me for not enoying the day and behaving like an autist, but I made something good enough for my own use case.

And requirements with python and making tests before install is hellish.

(Arg ... And why my doc does not show up on pypi? )

No comments: