Data Types are taught erroneously.


On irc #python-fr, we see numerous students struggling against their assignments. Mostly, they have not learnt the difference between an abstraction and an implementation. So I made a sample of Conway's game of life on github to illustrate what is wrong.

https://github.com/jul/game_of_life


http://en.wikipedia.org/wiki/File:Gospers_glider_gun.gif

 

 

Back to my first lesson in ... VLSI design. 



I did not graduate in CS but in micro electronics. I think our first lesson should be taught to CS students, because it highly impacts the way you code.

In electronic we think at wire level, with transistors. But, a circuit might have millions of transistors. Since, this complexity is not manageable either if you ignore how transistors works, or if you get lost in the wire schematics our first lesson is how not to blow our mind by mixing Bottom/Up approach and Top/Down.

Top Down or the blackbox approach


Top Down is about writing high level code, the more abstract possible, and about wiring sub blackboxes in bigger blackboxes. What is important in blackboxes is not how smart you are at what and how you implement inside the box, but how smart your at wiring your components, so that you can change your approach in case something goes wrong. When programming it is mostly the API level.

Top/Down is about designing smart interfaces

Bottom Up approach or knowing your limitations



Bottom Up is about focusing on the limitations of your circuitry (asynchronous ...) and its strengths. In a language such as python, basic Data Types, and the functional domain (recursion limited @1000 frames) ... This constrains your  development.

Bottom/Up approach is about building the inside of the blackbox knowing your limitations

The problem with students



Many students confuses the abstraction with the implementation, and game of life code is like a painful struggle to figure out what belongs to the game of life, and what belongs to the implementation.

See here a classical code

It is unreadable: you suffer with the coder.

If you want to change the rules you need to change the data type used. And you cannot change your implementation easily. Everything is mixed up.

Do you really think this is a good way of coding?

Top Down : Game Of Life main code is about stating explicitly the rules of the games.



from util.matrix import matrix

X = 16
Y = 14
DEAD = 0
ALIVE = 1

grid = matrix(X, Y, [DEAD]* X*Y)
oscillator = [(0,0),(0,1),(0,2)]
still = [(0,0),(0,1),(1,0),(1,1)]
glider = [(0,2),(1,2),(2,2),(2,1),(0,0)]

def at(grid,x,y, pattern):
    for dx,dy in pattern:
        grid.set(x+dx,y+dy,ALIVE)

at(grid, 1,8, glider)
at(grid, 2,3, oscillator)
at(grid, 6,5, still)

while True:
    time+=1
    print grid
    sleep(1)
    n_grid = grid.copy()
    for x in range(X):
        for y in range(Y):
            if grid.get(x, y):
                n_grid.set(x, y, grid.nb_living_around(x, y) in [2, 3])
            else:
                n_grid.set(x, y, grid.nb_living_around(x, y) == 3)
    grid = n_grid


By importing matrix, I clearly state the grid is a matrix like blackbox. I say the blackbox needs : a constructor, a copier, a 2D setter, and 2D getter.

My code is naive, therefore it is not abstract? How wrong: writing human readable code in a natural language fashion that seems idiotic is the highest level of abstraction.

The more your exposed code seems simple and human readable, the more abstract your code is.

Bottom Up : Flat is Better than nested !



In any programming language / architectural design flat is better than nested, since an array of array requires a double allocation, double addressing  ... it brings twice the trouble of a flat array.

If you access  a framebuffer, a picture, you work on matrices, but in reality it is a 2D array  abstraction given to you. The truth is video card, computers are working with linearly addressed memory because they are very good with contiguous chunks of data.

So if you have a look at matrix you'll see :


class matrix:
    
    def __init__(self,size_x,size_y, array_of_mutable):
        self.size_y=size_y
        self.size_x=size_x
        self.matrix= array_of_mutable

    def _oneD_offset(self,ix,iy):
        x,y= ix%self.size_x, iy%self.size_y
        offset = y*self.size_x+x
        return offset
        
    def get(self,x,y):
        """CBP powa"""
        return self.matrix[self._oneD_offset(x,y)]
    
    def nb_living_around(self,x,y):
        around = [ -1, 0, 1]

        return sum([
            int(self.get(x+dx,y+dy)) for dx in around for dy in around
                if (dx,dy) != (0,0)
        ])
        
    def set(self,x,y,val):
        self.matrix[self._oneD_offset(x,y)]=val
    
    def copy(self):
        copied_matrix = None
        if hasattr(self.matrix, "copy"):
            copied_matrix = self.matrix.copy()
        else:
            copied_matrix = [ x for x in self.matrix ]
        return matrix(self.size_x, self.size_y,copied_matrix )
    
    def __str__(self):
        to_print=" "
        to_print+=" ".join([ "%2d" % x for x in range(self.size_y) ])
        for x in range(self.size_x):
            for y in range(self.size_y):
                if (y==0):
                    to_print+="\n%2d " % x
                to_print+=" %2s" % ( "X" if self.get(x,y) else "." )
        return to_print


Well : this code states matrices are an abstraction, we are still using a blackbox approach I am even agnostic on the backend. My matrices are a view on any 1D Sequence.

Benefits : 

  • transposing is just about changing size_x, size_y thus I don't need costly transposition operations; 
  • copy is flat; 
  • tr(A+B) = tr(A) + tr(B) same goes for and/or/not/sub. If the Sequence supports element by element additions for two Sequences (like numpy arrays) theses operations will be dazzlingly fast;
  • I can change my «backend/data storage when I want». 

Duck Typing now (using interfaces)



Since I have mixed my approach I can now do neat things if my constraints are changing since my matrix is agnostic about the real data type I can use any Sequence :


### All these works !
#grid = matrix(X, Y, bytearray(X*Y))
#from numpy import array,zeros
#grid = matrix(X, Y, zeros(X*Y))
#grid = matrix(X, Y, [DEAD]* X*Y)
#from collections import defaultdict
#grid = matrix(X,Y,defaultdict(int,{}))
from util.weird_array import Bitmap, SparseArray
#grid = matrix(X,Y,Bitmap(DEAD))
grid = matrix(X,Y,SparseArray(set()))
#### end of golf

Bitmap is a Sequence interface wrapping an integer, SparseArray is a sparse implementation based on a set, defaultdict, or numpy.array and even bytearray work. A good coder should require as few constraints as possible on its real implementation for having the more escape routes in case something bad happens :

  • if you have very little memory and don't care about speed, Bitmap is the best;
  • if you want speed, the list implementation is slightly behind bytearray in performance, but is more polyvalent (you can implement a cellular automata with states > 255);
  • if you have a sparse board (the number of cells being alive being very more important than the dead one), then the sparse array is for you ...
  • Still the naive list implementation is the optimum regarding all criteria. 

Data literacy is not implementation, it is about properties and interfaces 



Most student we meet on IRC think knowing data type is knowing the exact big O notation for every concrete class operations. And, they never think of encapsulating data as views.

So, you painfully have to try to explain them, the interface of the data types are more important than the data type implementation, and also that data types are more then concrete implementation they are abstractions. And they struggle painfully because, they use the good abstractions, but the wrong implementation.Because they can't dissociate both.

I strongly suggest students to reason in terms of abstract base classes as defined here (which are interface based definitions of the data types) :

http://docs.python.org/library/collections.html#collections-abstract-base-classes

rather than in concrete implementation. But the problem is they are taught to do so!


Wrapping it up  : it's all about Interfaces



Always present a human friendly interface in TopDown approach, implement computer friendly data types in BottomUp approach.


I present a grid interface in the main file, this is the top down approach, the code real value lies in the fact it will be used by human. Still, the implementation, is computer friendly. 



As a result, the best compliment for a coder is when you are despised for writing naive code (at top level), and hated for not playing the same song at the wire level, because that is the nature of code.  

Coders are the mediators between two consistent yet exclusive logic : the one of the computers, and the one of the human beings. Coding is about making this mediation transparent.


So please, stop showing me the guts of your code and expect me to be impressed: I find it gross and a proof of your ignorance of what coding is.  And please teachers :  teach abstraction to your student, or leave the kids alone.

2 comments:

Bruno said...

http://swanson.github.com/blog/2012/05/31/primitive-obsession.html about another issue with relying on basic data types at the API level. You can easily be locked down with not enough abstraction.

jul said...

That looks like the perfect example of a leaky abstraction : an abstraction spilling its gusts at the higher level.