Why big data is a fraud: the actual dot com bubble according to CS 101

Just yesterday I was told: you know this big O notation, this database index, it is 30 years old. It is not true any more.

And I answered: "well, Newton said that if you jump from the 7th stair of this bulding you should die, it is 200 years old, why don't you give a try. this knowledge is so old it should be obsolete"

The fact is big O notation still matters.


CS 101 cheatsheet


Basically, the first lesson in CS 101 is called learning about complexity. It is very basic (except for the readers of hackernews that needs D3js animation to think they understand something and cannot read text).

It says, whatever you do, the bigger the size of a container, the more it takes resources to retrieve an item. However, you can trade memory for speed (and vice versa).

Electronic 101 says: you can have all the memory you want, but the bigger the size the more it will cost you in wiring. However you can trade indirection (speed) for money. But linear addressing is growing more than linearly so you should trade speed for money. Which means you have a diminishing returns.

What does it means?

Imagine you are poor and have one pair of socks. How much time to get it? One iteration.

You are rich, you have 1000 pairs of socks and want to find one. How much iteration?
Well, it depends. If you are organized and have space (memory) you can pre organize you socks in a fancy way. It will take on average an order of magnitude of ~log(1000) to find your socks. If you are poor (or not knowledgeable in the art of organizing socks) you will have to to examine all of your socks with an expectation of 1000 / 2 odds on average to find your socks. (worst case being 1000 with a probability of 1/1000)

Hum 7 CPU cycles vs 500 seems pretty good.

After all, actual computer have up to 16 cores doing 2.8 cycles per seconds (limit is speed of light and heat dissipation).

Well, no.

Transactions (stuff that yield money) are requiring a time line. An order. This is guaranteed by using a simple core. So you are bound to 2.8 cycles per seconds for transactions as long as you don't have a relationship...

You may use multi threading, but you will have point of serialization (join/lock/memory barrier) and doubling the cores/process/threads/instances/containers tends to yield a +40% increase in speed. It has diminishing returns.

In the long run CS 101 basically says:  "the more data you stores, the more resources you will use to search for an item and resource". Costs are a monotonic growing in function of the size with an efficiency that is less than linear.

Basically if 100 customers cost you 1$ to handle, 1000 customers will always costs you more than 10$.


How bad is it?

Bad.

Going in register is 1 cycle
Going in cache is ~150 cycle
Going in RAM  ~500 cycle
Going on a Hard Drive is 15000 cycles...

You can trade memory for speed, but the cost of indirection make the system have an absolute minimum.

You may use cache. But is just diminishes the latency as long as you are in under-run situation. It just mask the problem with a delayed effect.


But resource use however smart you are is more than linear function of the size of collection in which you search aggravated by relationships.

It basically means that even if you are google or facebook, the more data and customers you have the more your costs will increase.

What it means is in terms of economy: the more customers you have the lesser your profits.


Are dot com stupids?



Hell no. All is a question of opportunity. Thanks to QE the stock exchange is full of liquidities.

This diminishing returns are noticeable/measurable only after a big enough growth. Time series are growing linearly with time, and the customer base... grows its way (slowly most of the times and IPO is often before the full success).

Venture capitalists are in for the money. They don't care if the market will not be sustainable in 15 years, they aim for a profit in 10 years where the effects will not be noticeable.

Developers are ... well ... clueless or needing money to reimburse their student loans, and powerless.


Customers... well... if they don't adopt the new technology that seems to be 10 x less expensive (for now) they don't care about the 5 next years if they are wiped out by the concurrence in 2 years.

And founders are either clueless (and lucky), or they have enough money to mask the problem. (When you will sell your shares for 10 times your investment after 5 years, you can invest on diminishing returns for 5 years, it is just a problem of how wealthy you are, and in most occidental economies it basically correlates with the fact of being born wealthy).

I say it loud: a business with systematic diminishing returns is doomed. If the more customers, the lesser the profits then something is wrong.

The stock market is totally out of touch with the economical reality, and this is a sign of a bubble.


.




De mes hypothèses farfelues et réaliste pour éradiquer les punaises de lit

Je passerais sous silence une possible expérience dont je ne connais pas l'issue, mais dont les circonstances sont tellement abracadabrantesques que toutes les issues entâcheront certainement à jamais ma renommée d'homme de bon sens.

Néanmoins, j'ai une hypothèse.

Je pensais chercher un répulsif parce que c'est ce que l'on veut, et je me suis retrouvé à chercher une substance attractrice car c'est facile de vérifier qu'on à trouvé.

Par contre, c'est une analyse coût risque. Ce calcul certes rationnel économiquement (comment avoir confiance dans une recherche que l'on ne peut prouver est un vrai problème). On imagine à tort, que trouver un vrai positif est plus dur que de détecter un vrai négatif.

On imagine qu'il est plus facile de décrire ce qui est que ce qui n'est pas.

Comment je fais quand j'ai pas de punaises pour prouver que je les repousse avec une substance?

Je ne peux pas, je dois avoir confiance. La monnaie c'est la confiance après tout (fiduceo).

Et elles plus facile pour tout individu de croire en ce qu'il contaste ....

Personne n'achétera de répulsif sur le marché. Car les gens finiront (et pas toujours à tort) qu'on leur vend de la poudre de perlinpinpin.

Donc, les substances répulsives, pourtant celles nécessitées, ne peuvent être pour raisons de confiance et donc de logique achetées. En clair, si t'achète ça t'es peut être un gogo.

 L'éradication des punaises marchera par une révolution que je hais (puisqu'elle a abouti à l'interdiction de l'absinthe) de l'hygiène.

Pas la stupide genre : "fais pas çi, fait pas ça", mais plutôt d'impliquer les gens dans la création de cette nouvelle hygiène par leur propre intelligence afin de faciliter son adoption.

Si les gens peuvent expérimenter, bâtir leur propre corpus de connaissances dans lequel ils peuvent avoir confiance, ils adopteront plus facilement des mesures permettant de lutter contre ce fléau.

Je me sens certes dans une cause aussi importante que celle qui engendra une des premières jaqueries du Vexin pour interdire que le saumon fût servit tous les jours de la semaine (oui l'Oise était saumoneuse au moyen âge). Mais je m'y tiens.

Je suis comme St Louis, pas vraiment le blaze qui traînait dans le coin, mais le gadgo qui s'en revenait ahuri. Je sens comme Villon prêt à dire "Paris près de Pontoise". Une foule d'obscures inconnus dont la loufoquerie et la vaillance ont contoyé le tragique et la pleutrerie.

Comme ceux qui m'ont inspiré, je trouve certes farfelus et loufoque de dire qu'il faut que ce soient les citoyens qui prennent en main avec leur propre moyens intellectuels et physique la recherche dans ce domaine appliqué.

Ce n'est pas une recherche rentable, et de toute façon vous auriez raison de douter empiriquement de toute substance répulsive.

Il faut que vous fassiez partie de la résolution scientifique et expérimentales de ce problème pour virer ce problème. En apportant idée, et tant à reproduire les expérimentations pour les valider et les invalider. Vous devez aussi proposer des solutions, à valider ou invalider.

Quand je parle des punaises de lits. En fait, je pense que cela s'appliquer à tous les problèmes.











So I wrote a Proof of Concept language to address the problem of safe eval

I told fellow coders: «hey! I know a solution to the safe eval problem: it is right under my eyes». I think I can code it in less than 24 hours from scratch. It will support safe templating... Because That's the primary purpose for it.


TL; DR:


I was told my solution was overengineering because writing a language is so much efforts. Actually it took me less time to write a language without any theorical knowledge than the time I have been loosing in my various jobs every single time to deal with unsafe eval.

Here is the result in python : a forth based templating language that does actually covers 90% of the real used case I have experienced that is a fair balance between time to code and real features people uses.


You don't actually need that much features.

https://github.com/jul/confined (+pypi package)

NB Work in progress

 

How I was tortured as a student


When I was a student, I was nicely helped through the hell of my chaotic studies by people in a university called ENS.

In exchange of their help I had to code for data measurement/labs with various language OS, and environment.

I was tortured because I liked programming and I did not have the right to do OOP, malloc, use new language .... Perl, python, new version of C standards...

Even for handling numbers scientifics were despising perl/python because of their inaptitude to safely handle maths. I had to use the «numerical recipies» and/or fortran. (I checked in 2005 they tried and were disappointed by python, I guess since then they might use numpy  that is basically binding on safe ports of numerical recipies in fortran). I was working on chaotic system that are really sensitive to initial conditions ... a small error in the input propagate fast.

The people were saying: we need this code to work and we need to be able to reuse it, and we need our output to be reproducible and verifiable : KISS. Keep It Simple Stupid. And even more stupid.

So I was barred from any unbound resource behaviour, unsafe behaviour with base types.

Actually by curiosity I recompiled code that was using C and piping output to tcl/tk I made at this time to make graphical representation of multi agent simulations and it still works... It was written in 1996.

That's how I learnt programming : by doing the worst possible unfunky programming ever.  I thought they were just stupid grumpy old men.

And I also had to use scientific equipment/softwares. They oddly enough all used forth RPN notations to enable users some basic manipulation.

Like:
  1. ASYST
  2. RRD Tools
  3. pytables NUMEPXR extension http://code.google.com/p/numexpr
And I realized I understood:

FORTH are easy to implement:
  • it is a simple left to right parsing technique: no backtracking/no states;
  • the grammar is easy to write; 
  • the memory model makes it easy to confine in boundaries;
  • it is immutable in its serialization (you can drop exec and data stack and safely resume/start/transport them)
  • it is thus efficient for parallization,
  • it thus can be used in embedded stuff (like measurement instruments that needs to be autonomous AND programmable)
 So I decide to give me one day to code in python a safe confined interpreter.

I was told it was complex to write a language especially when like I do, I never had any lessons/interests in parsing/language theory and I suck at mathematics.


Design choices


Having the minimum dependency requirements: stdlib.

 One number to rule them all 

I have been beaten so much time in web development by the floating point number especially for monetary values that I wanted a number that could do fixed point calculus. And also I have been beaten so many time by problems were the input were sensitive to initial conditions I wanted a number that would be better than IEEE 754 to potentially control errors.
So I went for the stdlib IEEE 854 officious standard based number : https://docs.python.org/2/library/decimal.html
Other advantages: string representation (IEEE 754) is canonical and the regexp is well known. Thus easy to parse.

In face of ambiguity refuse to guess

I will try to see input as (char *) and have the decoding being explicit.
Rationale: if you work with SIP (I do) headers are latin1 and if you work in an international environment you may have to face data incorrectly encoded that can also represent UTF8 and people in this place (Québec love to use accents éverywhere). So I want to use it myself.

It is also the reason I used my check_arg library to enforce type checking of my operators and document stuff by using a KISS approach: function names should be explicit and their args should tell you everything.

Having a modular grammar so that operators/base types can be added/removed easily. 


I evoked in a precedent post how we cannot do safe eval in python because keywords and cannot be controled. So I decided to have a dynamic grammar built at tokenization time (the code has the possibility to do it, it is not yet available through the API).

Avoid nested data structures recursive calls


I wanted to do a language my fellow mentors could use safely. I may implement recursive eval in the future but I will enforce a very limited level of recursion. But, I see a solution to replace nested calls by using the stack.

Stateless and immutables only


I have seen so many times people pickling function that I decided to have something more usable for remote execution. I also wanted my code to be idempotent. If parsing is seen as a function I wanted to guaranty that

parsing(Input, Environment) => output 

would be guaranteed to be always the same
We can also serialize the exec stack the data stack at any given moment to change it later. I want no side effects. As a result there will ne no time related functions.

As a result you can safely execute remote code.

Resource use should be controlled


Stack size, size of the input, recursion level, the initial state of the interpreter (default encoding, precision, number behaviours). I want to control everything (that what context will be for and all parameters WILL have to be mandatory). So that I can guaranty the most I can (I was thinking of writing C extensions to ensure we DONT use atof/atoi but strtol/f ...).

This way I can avoid to use an awful lot of virtual machines/docker/jails whatever.

Grammar should be easy to read


Since I don't know how to parse, but I love damian conway, I looked at Regexp::Grammar and I said: Oh! I want something like this.

There are numerous resource on stackoverflow on  how to parse exactly various base types (floats, strings). How to alternate and patterns... So that it took me 3 hours to imagine a way to do it. So I still know nothing of parsing and stuff, but I knew I would have a result.

I chose a grammar that can be written in a way to avoid backtracking (left to right helped a lot) to avoid the regexp to be uncontrolled.

I am not sure of what it does, but I am pretty sure it can be ported in C or whatever that guarantees NO nested/recursive use of resources. (regexp are not supposed to stay in a hardened version this is just a good enough parser written in 3 hours with my insufficient knowledge).

I still think Perl is right


We should do our unittest before our install. So my module refuse to install if the single actual test I put (as a POC) does not pass.


Conclusion


So it really worths the time spent. And now I may be in the «cour des grands» of the coders that implemented their own language, from scratch and without any prior theorical knowledge of how to write one. So I have been geeking alone in front of my computer and my wife is pissed at me for not enoying the day and behaving like an autist, but I made something good enough for my own use case.

And requirements with python and making tests before install is hellish.

(Arg ... And why my doc does not show up on pypi? )

Eval is even more really dangerous than you think


Preamble, I know about this excellent article:
http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html

I have a bigger objection than ned to use eval; python has potentially unsafe base types.

I had this discussion with a guy at pycon about being able to safely process templates and do simple user defined formating operations without rolling your own home made language with data coming from user input interpolated by python. Using python for only the basic operations.

And my friend told me interpolating some data from python with all builtins and globals removed could be faster. After all letting your customer specify "%12.2f" in his customs preference for items price can't do any harm. He even said: nothing wrong can happen: I even reduce the possibility with a regexp validation. And they don't have the size to put ned's trick in 32 characters, how much harm can you do?

His regexp was complex, and I told him can I try something?

and I wrote "%2000.2000f" % 0.0 then '*' * 20 and 2**2**2**2**2

all of them validated.

Nothing wrong. Isn't it?

My point is even if we patched python eval function and or managed sandboxing in python, python is inherently unsafe as ruby and php (and perl) in the base type.

And since we can't change the behaviour of base type we should never let people use a python interpreter even reduced as a calculator or a templating language with uncontrolled user inputs.

Base types and keywords cannot be removed from any interpreters.

And take the string defined as:

"*" * much

this will multiply the string by much octets and thus allocate the memory ... (also in perl, php, ruby, bash, python, vimscripts, elispc)
And it cant be removed from the language, keywords * and base types are being part of the core of the language. If you change them, you have another language.

"%2000000.2000000f" % 0.0 is funny to execute, it is CPU hungry.

We may change it. But I guess that a lot of application out there depend on python/perl/PHP ruby NOT throwing an exception when you do "%x.yf" with x+y bigger than the possible size of the number. And where would set the limit ?

Using any modern scripting language as a calculator is like being a C coders still not understanding why using printf/scanf/memcpy deserve the direct elimination of the C dev pool.

Take the int... when we overflow, python dynamically allocate a bigger number. And since exponentiation operator has the opposite priority as in math, it grows even faster, allocating huge memory in a matter of small iterations. (ruby does too, Perl requires the Math::BigInt to have this behaviour)

It is not python is a bad language. He is an excellent one, because of «these flaws». C knight coders like to bash python for this kind of behaviour because of this uncontroled use of resources. Yes, but in return we avoid the hell of malloc and have far less buffer overflow. Bugs that costs resources too. And don't avoid this:

#include <"stdio.h">

void main(void){
    printf("%100000.200f", 0.0);
}

And ok, javascript does not have the "%what.milles" bug (nicely done js), but he has probably other ones.


So, the question is how to be safe?

As long as we don't have powerful interpreter like python and others with resource control, we have to resort to other languages.


I may have an answer : use Lua.

https://pypi.python.org/pypi/lupa

I checked  most of this explosive base type behaviour don't happen.

But, please, never use ruby, php, perl, bash, vim, elispc, ksh, csh, python has a reduced interpreter for doing basic scripting operation or templating with uncontrolled user input (I mean human controlled by someone that knows coding). Even for a calculator it is dangerous.

What makes python a good language makes him also a dangerous language. I like it for the same reasons I fear to let user inputs be interpreted by it.

EDIT: format http://pyformat.info/ is definitely a good idea.
EDIT++: http://beauty-of-imagination.blogspot.ca/2015/04/so-i-wrote-proof-of-concept-language-to.html

Brave HN-ew world

Hello,

I am a troll, and I feel wrongly attacked and pained by new HN's guideline.

First to get your attention I can help solve this mystery: «why do people trolls? What is in the brain of these (sick?) persons?»

Well; nothing. It is purely gratuitous.

It is very often a strike of bad luck.

First you have to be partially extrovert, and a little dense to actual people. Then you have to be in a bad mood, or inspired, or worse listening to an old argument that lead to a big stupidity.

You know like: let's have a problem that actually boils down to a KSAT problem that is already taking minutes that is a famous NP complex stuff and pretend we can make that scale, and become a viable product...

But, you know dependency hell/devops fortune is a K-SAT problem.

Well, now that the mystery is resolved; trolls are pure random events (at least for me).

Let's first show Aldous Huxley predicted that moment.

My favourite SF book of all time (when I was 12) Brave New world(!)

The story of an asocial guy that lives in an hedonist societies of clones conforming to the standards of «likes» and refusing to hear how they maybe wrong.

The poor guy becomes an emo at the end listening to 3 days grace soundtrack...


I am doubly pained, because in fact, I am also a fan of three days grace. I nearly got killed at the concert though... when he marched doing is weired gestures seriously I was ostensibly making fun.

These emo fans are so violents....


But you know, at the opposite of Hacker News, I don't think emo bands will get ternish by my absolute uncanny trollesque humour.

And they still did not posted an anti troll guideline for concert. You know, I am really relieved, because, shit, I am stupid fan. And I both love them, and love making fun of them (when they deserve it).

Did people tried or beat me in real life.

Well ... Especially in metal bar. I am metal fan, but sometimes they take it too seriously so I make fun, and they love to look scary, and it is even funnier, so I may have gotten into troubles.

But 3 days grace does not care, it is a random events out of millions others that can affect their life.

You know, you can see them as entrepreneurs too. And me as a troll for their business. I am degrading the life of their community and do not «avoid gratuitous negativity» towards this thick skin creator...


I may be wrong but I am butterfly to the true entrepreneurs: if my wings affects their business is either that their business is weak or the bad luck of chaotic system or their personalities are weak.

I will accept, that 3 days grace could have been hurt a lot when they were young emotionnal. They maybe canadians, but if they were bulllied gothics, it made them have very nice lyrics about this. Yes I might have been may be a bullier.  And you want your startup to be free of them.

You may not want your startup to be faced early to the fact that there are gratuitous negative people, but this will happen.

Okay, I may see the point you may not want your startups to become emo based business.

But trolls are random events. Believe it or not, it often is for the truth a combination of misunderstanding (most of the time that I am right), poor words, and bad mood. And maybe a bad nature, I mean, I really have hard time not laughing sometimes reading HN).

A troll has a use, it is a noise that your startup will have to face. Sooner or later... You maybe not want to discover a poor 1¢ random event destroyed your 1M$ toy before you get your investments back.

And, to conclude, under this troll, there is a human, with a soul...

In fact, I prefer to conclude about Brave New World/1984 synopsis:

I don't know why in dystopia the hero as a suicidal tendency to be a troll.

And the society has a tendency to hate trolls.

And society wins way more often then heros that are trolls.

But I will stand as a proud troll! I shall win!

April fool

I love april fool: I have 30 minutes to say what I think without people knowing if I mean or not, if it is true or false.

IT sux most of the time

What we are doing is insanely complex and breakable and we are overpaid for it.


7 clicks to set an alarm, 20 minutes to begin reading a blue ray disc legally bought, having to pirate a window to install a window that was genuinly bought from micsosoft, my ubuntu distribution actually frying my computers....And thanks to google music taste almost like regressing from 2000 to 1990's...

And it is supposed to be called progress...

Okay young lads, april fools' it really used to be better in the ol' time.





And for taking part in this my income are twice the median in a world where rich are richer : I am belonging to the best of our society...

April fool, this is not true; the truth is a lie

The more I code, the more I love my rice cooker

They are stuff that used to be easy to do in life:

  • set up an alarm on a clock;
  • buy a good and have warranty magically working;
  • playing a video;
  • finding and listening to music.
Since the day of webapp, thousands of wannabee bill gates are reinventing alarm on phones.

The one that are following your sleep patterns taking light exposure into account, the one that synchronize with your MyProvider(c)(tm) calendar interoperable with an obscure IETF standard in draft mode, the one that have a nice interface.

But nothing that actually increased my chances of waking up on time.

My dreamed life, my real life, and all the mistakes coming in the middle

So the other day my lady forgot to put her alarm clock and nearly got fired.

How?

We missed in the 5th steps ot the UI enhanced experience of the alarm clock apps after the 4th steps of validation... but we already had managed to do the 4 steps of task switching while tired.

Then we bought an old fashion 2 steps "I can set an alarm" clock and our problem disappeared.


I like to rant, and I will rant: our so called improvement are shit.

We add levels of indirection on a phone to handle a task it is not supposed to do: can you really trust a clock with 24 hours autonomy in the first place for waking you up every morning of the year?

And then we add way more tasks in a phone in the name of progress....

Sure we have enough memory and power to do everything but can we do it well?


And then I made a fondue with my rice cooker


I may be disappointed with computers, but I still believe in progress. And tonight I discovered I could use my rice cooker to cook fondue.

When I was a kid I loved this shit: thou shall melt cheeses together and shall eat them with dried crunchy bread you lovely applied garlic on (and a tinge of olive oil with basil (and you shall drink a wine that complements with it or you shall rot in hell)).

I am pretty sure it is the lost eleventh commandment of Moise.

The problem was finding the very specific pans and heaters to make it that were costing a lot when I was a kid. The time for finding this stupid stuffs we used once a year in the attic was half of the mission. And had we not these artefacts the ceremony of the fondue would be cancelled. Leaving our friends in tourments worse than hell.

And then I discovered I could put the cheeses in my rice cooker, use the cook button, and like god on earth descending to atone for my sins, a perfect cheese fondue would be there.

What the rice cooker says about our code

With a rice cooker that same state machine/interface make it for a lot of awesome use:
  • cooking rice;
  • cooking al dente amazing brocoli and asperges;
  • making bread;
  • making savoyarde and vietnamise fondues ...
With a dazzling amazing interface : cook // keep warm.

A cook them all in one click...

On the other hand in my code every time I make a new feature I have to add a new distinct routing options with at least a new YES/NO branch (that can be implicit).

I must admit in terms of UI, I am freaking jealous of rice cookers: with one interface they solve more than one problem, and me, I have to add new branches every time a new choice is made and I make the application weaker.


Rice cooker should be the model of UI we are aiming for going the other way than smartphones:

Whereas smartphones acquire new capacities by making interface more complex, rice cooker are so well designed that with the same interface that is the most simplest/efficient one in the world they can cope with more than one problem.

I am french, making fondue when feeling homesick means a lot to me...

I have all the more respect for these eastern geniuses that devised the smartest versatile device that is my model of simplicity.

I wish my code was a rice cooker.