My problem with Computer Science pseudo code

I remembered opening books of Computer Science to learn algorithm when I began. And my first reaction was: it seems unnecessarily complicated like maths, but I must be an idiot since I learnt physics.

So 15 years later I reopened Introduction To Algorithms By Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, Clifford Stein. And I decided to see if my experience in coding would help me appreciate this book for all the knowledge it could bring me.

So I decided to take a look at the simple heap structure and algorithms:

Let's look at the pseudo code (see here for the whole chapter):

When I see this pseudo code, and think of actual production code, I am dazzled:

  • it has meaningless variable;
  • API and variable names (like the global heap_size) are quite hard to grok; 
  • I wonder if recursion is really important knowing it is hard to read, debug, and may reach a recursion limit;
  • I have a hard time understanding how it works. 
So  I thought, maybe I am just an idiot and I tried to see if I could do a better job at writing pseudo code which would help understand. (I used python as a pseudo code checker)

Here is my heapify function (the whole code is here  https://gist.github.com/3927135 if you want to check the correctness):

def heapify(_array,index=1, heap_size=SENTINEL):
    """bubbles down until it is correctly located.
    heap_size can be used internally for optimization purpose"""
    if heap_size == SENTINEL:
        heap_size= len(_array)
    a_child_is_bigger= True
    while a_child_is_bigger:
        largest= index
        left_pos, right_pos = left_offset(index), right_offset(index)

        #check left
        if left_pos < heap_size and _array[left_pos] > _array[index]:
            largest= left_pos

        #check right
        if right_pos < heap_size and _array[right_pos]>_array[largest]:
            largest= right_pos

        if largest == index:
            # we are finally the king of the hill, end of story
            a_child_is_bigger= False
        else:
            #swap with child // bubble down
            _array[index], _array[largest] = _array[largest], _array[index]
            index= largest

And by coding it revealed what I feared the most: CS and computer developers don't live in the same world: the way a code is written matters.

  1. getting read of the asymmetric test line 5 greatly improves the understanding of the logic;
  2. by using full name for tests conditions you really help people (you included) understanding thus maintaining your code;
  3. recursion comes in the way of understanding straightforward code; 
  4. one letter variable names are really unhelpful variable names;
  5. their code typography is even worse than mine.
I have made python code to check the whole logic: writing a tinge more readable code does not seems to prevent it from working (I may have a problem on the boundary because this hidden heap_size variable is a pain to understand).

By the way if you want really efficient heap structure in python please use : http://docs.python.org/library/heapq.html because it is just plain better, superbly documented, tested and the source code is quite nice too.

Now, I am plain lost: when I check CS papers because I want to solve a problem I often stumble on unreadable pseudo code and verbiage. Understanding the API, the variable name ... takes an awful lot of time. When I write in python I have the feeling that it is a Pseudo Code Translator, so if we assume my implementation is close visually and logically from pseudo code, does it costs that much to improve pseudo code readability?

If I can do it knowing that I am one of the most stupid developer among the one I know, why or how can't they do the same. Pseudo code is for sharing knowledge, do we really need to waste time de-ciphering these pseudo codes to be a real or better developer?

When you see equivalent code in industry grade application, you normally hate that. It is plain bad practices. Isn't computer sciences supposed to teach elite grade developers?

And when I see such code, I sometime wonder if the guy understood what he wrote.

Meaningful variables names, test conditions, readability is what matters, because code is meant to be improved and fixed.

With all these questions in my head I came to a conclusion: I don't take for profound what I think is uselessly obscure, so my opinion of CS papers and academics has dramatically dropped.


EDIT: now, I see that bulbbling down and bubbling up can be extracted from the logic, and I guess it can be used to make a heap usable for min/max efficient inserting/extraction.


10 comments:

rodrigob said...

Please note that 15 years is "very very old" in computer science terms. You are judging the work done 15 years ago with today's perspective.

This is as judging last century medicine with today's perspective.

In the last 15 years the practices in CS research and publishing have evolved. Open access and open source publications are now much more common and "popular methods" usually also come with reference implementations (in real code !).

Florian said...

CS aims to produce degrees which are built on papers. The formalism in use isn't to make things easy to actually understand and do, it's designed to make it more likely your peers will accept your paper leading to your degree.

CS and real world coding have different goals. One is preoccupied with mental masturbation, while the other is with making stuff work.

Anonymous said...

Computer Science is not programming/Software Engineering, it's almost a branch of mathematics, and in mathematics they don't need verbose "variable" names because theorems and proofs are relatively short (oh, and they have a lot of cool greek symbols spread in multi-level sub/super-scripts). So if an algorithm is derived from mathematical definitions I find it natural to have mathematical-style naming.

Also, to be honest, I like the original pseudo-code more. I think every real-world programming language has too many micro-semantics to serve good as pseudo-code (in your example it would be: default function arguments, tuple assignments, _array name to avoid conflict with array module).

Anonymous said...

Your code would be easier on the eyes if you used a style checker. This one is quite good: https://github.com/jbalogh/check

Michael Bastos said...

I knew how to do real world programming long before I ever step foot in a classroom to get my CS degree. The battle here is not between CS and Real World Development, it's between the basic concepts of needing a degree to do what you do and just being intelligent enough to learn it on your own. All major degree requiring jobs that exist today followed that same path and were once mastered by people who could simply learn it from home by doing nothing more than reading books and practicing yourself. Lawyers once could once be self taught and as long as they could pass the bar they could practice law. Doctors once could practice on their own and as long as they were good and could pass a few tests they could do their job and practice their skills. So though the CS programs for most schools and the research in current CS departments outside of Stanford or MIT have a ton to catch up on with the private sector, eventually every industry takes this same path. Again the problem is not with CS but with the whole notion of whether or not you need a degree to do what we do, pseudo code or not, there is still a ton of catch up that the scholarly side of academia can catch up to what a trillion dollar industry is doing in what has become a development space race.

jul said...

On one one hand it seems to be still the algorithm used:
http://en.wikipedia.org/wiki/Binary_heap see the references? This book is quoted (the same one I link too). (And so is python heapq:)
And yes the wikipedia pseudo code is copied from this very book.
So my point is still valid


On the other hand, looking at recent papers (lru cache, web mining, data clustering) when I can find an article free to read with pseudo code, I am not convinced the situation really improved.


John Haugeland said...

It was terrible code 15 years ago, too.

HaakonLøtveit said...

Which one is trying to make stuff work?

Because sometimes I'm not so sure anymore... :(

bellyfullofrage said...

Universities are degree factories and they sell them to people, that's how they stay in business. You don't get a CS degrees to get a job, or at least you shouldn't. You study CS because you love it and have a desire to learn and advance the craft. Mathematics and physics both have their own formalisms that allow people to share ideas and computer science (relatively new compared to math/phys) is still trying to come up with universal formalisms. Pseudo code is just another language, a tool, if you can't "grok" it then you're in the wrong business. You can't go running back to something comfortable (python) every time you run into something that scares you, that just means you don't know anything about computers, you just know python. Go deeper. Develop a real understanding and use python (or your tool of choice) to apply your understanding.

jul said...

you are full of rage indeed. :)

If I can make a clearer code that highlight two sub functions bubble up and bubble down, and that then I spot that a certain symmetry has been missed (min/max up/down) then it means I understood the pseudo code, and I see that we even could achieve a min-max-heap easily.

But the only reason I can make code that is clear (with full name variables) is I understood. I don't believe that messy unreadable code is the sign of comprehension.

That's because I can grok it that I feel the author was not comfortable with his topic.

What is obscure it not profound, it is just the sign of a confused mind.