On geometry, commutativity and relativityTLDR; It all boils down to the definition of time and the rupture of symmetries.
A distributed system is a system on which code can be executed on more than one instance independent and will give the same results wherever it is executed.
On ideal distributed system as a vectorial system.
For a distributed to work, you need a minimal property of the functions that are passed: the operation needs to be commutative (and distributive).
Be A a set of data, f, g functions that applies on the data and A[i] subset of data on instance i.
f(g(A)) == «Sum» of (f x g ( A[i])) on all instances/partitions.
Distributed system for avoiding SPOF are rerouting operations on any instances that is available. Thus the results should be idempotent wherever they are made.
We can either work iteratively on a vector of data, or in parallel on each element as long as there is no coupling between each elements (which can be expressed as for k, l with k!=l and k, l < i then A[k] dot A[l] == 0, or that each element are orthogonal/without relationships, thus the set of elements is a base of size i)
map reduce philosophy is about stating that data in n different location can be treated indepently and then reduced.
They are 2 kinds of functions (given you work on the base):
* Transformations ( V ) => V These functions applies a geometric transformations into space (rotation, translation, homothetia, permutation) also called Observables.
* Projectors ( V ) => Vi that are reducing the number of dimensions of a problem.
Data is a ket |ai> of states
Transformations are Operator applying on the Kets such as O|ai> = |bi>
if there exists an operator O^-1 such as O x O^-1 = identity than O is reversible, it is a Transformation or mapping.
O is called functions
|ai> is input data
|bi> is called output
If dim(| bi >) < dim(| ai>) we have a projector
If dim(| bi >) > dim(| ai>) we have a local increase of information in a closed system.
Given a well known function that are linear we have for a composed function to be a transformation of the significant space of data the property that O x P = P x O or that [P, O] = 0 (the commutator of f, g) then you can do out of order execution.
But sometimes Projectors and Transformations are commutative :
from __future__ import division
from random import randint
from numpy import array as a
DATA_PER_SERIE = 10
MAX_SERIE = 100
data = a([ a([ randint(0,MAX_INT) for i in range(DATA_PER_SERIE) ]) for s in range(MAX_SERIE)])
In actual CPU, DIV and ADD are NOT commutative.
time(ADD) != time(DIV), at the least reasons, because the size of the circuits is not the same and because min(time) = distance/c where c is the celerity of the propagation of the information carrier. If the information carrier is the pressure of the electron gaz in the substrate (electron have a mass, they travel way slower than light, but pressure is a force that is causal thus c is the speed of light). What is true in a CPU is also true when considering a distributed system.
Computer are introducing loss of symmetries, that is the root of all the synchronization problems.
It happens when we have less degrees of liberty in the studied system than in the space of the input.
When we do this, it means that we are storing too much data.
For storing enough data you need to have a minmal set of operators such as given O, P ... Z each operators commutating with each others. It is called a base.
Given a set of data expressed in the base, the minimal operations that are commutative are also called symmetries of the system.
Applied to a computer problem a computer scientist might be puzzled.
I am globally saying that the useful informations that makes you able to make sense of your data are not in the data, nor in the function but in the knowledge of the functions that as a pair commutes when applied to the data.
Knowing if two dimensions i, j in a set of data projected in the base is equivalent as saying that i and j are generated by two commutative operators
I am saying that I don't know the base of the problem and/or the coupling if I find to operator such as for any input [O,P]=0. // OP|a> = PO|a> THEN I discovered an element of the absolute pertinent data.
given Actual Data |ai> and |aj> where max(i) = n
then <ai|aj> = 0 if and only if there exists 1 Projector I that projects the |ai> and |aj> on two different transformations.
The iron rule is the number of degrees of liberties of lost resulting by applying I must never results in having less dimension than the base.
First question: How to I get the first function?
Second one, how do I know the size of the base of the functions that combined together describes the system in its exact independent degrees of liberty (the minimum set of valuable data)?
And last how do I get all the generators once I know one?
Well, that is where human beings are supposed to do their jobs, that is where our added value is. In fact, you don't search for the first operator of the base, you search for sets of operator that commutes.
Determining the minimum sets of information needed to describe a problem exactly with independent informations is called compression.
So what is the problem with big data? And time?
Quantum mechanic/Vectorial/parallel computation is nice but is has no clock.
In fact I lie.
If given n operations [ O0 ... On ] applied to a set of data there is one called P such as [ On, P ] !=0 then we can't choose the order of the operation.
The rupture of symmetry in a chain of observable applied to data introduces a thing called time.
As soon as this appears, we must introduce a scheduler for the operation to make sure the chain of observables commuting are fully applied before chaining the next set of operations. This operation is called reduce.
That is the point where a system MUST absolutely have a transactional part in its operations.
Now let's talk about real world.
Relativity tells us that time for any system is varying. On the other hand our data should be immutable, but data that don't change are read only.
And we want data to change, like our bank account figures.
And we also want that we don't need to go physically to our bank to withdraw money. And bank don't want you to spend more money than you have.
This property is called transactionality. It is a system accepting no symmetry, thus no factorisation.
It requires that a chain of operations MUST not be commutative.
At every turn a non linear function must be performed:
if bank account < 0 : stop chain.
This breaks the symmetry, and it requires a central point that acts as an absolute referential (with its clock for timestamping).
Banks are smart, they just don't use fully transactionnal systems, nor distributed systems ; they just use logs and some heuristics. There must be a synchronicity time attack possible on this system.
On the other hand since operations are not possibly chronologically commutative on a computer and all the more on a set of computers, it means distributed system main challenge is «time stamping» the events.
We know since Einstein that clocks cannot be the same on a distributed system without a mechanism.
Every body thinks NTP is sufficent.
But, NTP has discreet drifts. This drifts that are almost non predictable (sensitivity to initial conditions) introduces a margin of uncertainty on time.
Thus for every system the maximum reliable granularity should be computed so that we can ensure the information has physically the possibility to be know before/after every change.
The bigger the system, the higher the uncertainty (relativity ++).
Given the reduced operations that can commute on set of data, the clock should also be computed given the length of the maximal operation.