Lost in cache optimization

Premature (and Micro) optimisation is the root of all evil

We all know this, don't we? And even though I know it, I do still fall in the trap.

Let me tell you my story: it all began with profiling a basic web log parser  I did fairly fast (core of the project was done in 2 days (demo here)). But before releasing it, bmsipelon wanted to check if the implementation was efficient enough so he made a benchmark of the algorithms. Thanks to him I corrected a bug in my library.

You may notice we've had a basic memoizer since the begining?

I was quite raging that his code was running faster than mine, so ... I decided to begin optimizations... And all hell broke loose.

Context : I was parsing a 191K lines apache log in 24 seconds with bmispelon implementation and in 30 seconds with my implementation. And I wanted to beat him badly.  That's called coopetition (competing in cooperation).


After running runsnake I had this gorgeous report :

Well, I also discovered map reduce is more costly and not that memory efficient compared to a simple loop Simple is better than complex.

Also, Hankyu is not the most efficient (it is beaten by default dict). The fractal look of __iadd__ and isintance is the cost of recursion. Remember: Flat is better than nested.

However, to change the shape of his results he needs to change a lot of code, I only need to change one argument. Cost vs advantages.

I thought that by getting rid of the fat (in httpagent parser, and date formating) I would win my 6 seconds.

As a result, I tried to improve the cache, by measuring its impact. Then I measured and here was the result for the cached date_formater:

    "date_format": {
        "len": 143142,
        "hit": {
            "hit": 49972, 
            "time": 0.08554697036743164
        "miss": {
            "hit": 143142, 
            "time": 4.912769079208374

It was a pleasure, and a drawback: my strategy seemed efficient in terms of performance, but explosive in terms of memory. So I decided to give a try to py3 functools.lru_cache and beaker.

functools.lru_cache and beaker were taking 30 seconds instead of 26 seconds! I thought to myself: these guys must over engineer. So I made my own crude fixed size LookupTable. As usual, Baptiste bested my circular buffer ring like implementation with its implementation. I had carefully looked at all big O notation for the complexity and tried to choose the containers according to my needs. So I was raging.

I also thought: Hey genius, date in logs are monotonically growing but sometimes constant, then why handle the burden of a dict that will blow up your memory, when storing a simple pair of values works? So I implemented, and «measured» with my tool the impact of a monotonic cache:

    "date_format": {
        "hit": {
            "hit": 49972, 
            "time": 0.0680077075958252
        "miss": {
            "hit": 143142, 
            "time": 4.508622646331787
Hurray! I won an award of the best idea of the century (probably existing in some old FORTRAN code) and rediscovered that caching strategies should take the shape of the data in consideration. Not only was my algorithm more efficient, but also sparser in terms of memory. So it was time to prepare some nice data to share my discovery.

So I decided to add more caching backends and test them all to look serious. I added repoze.lru_cache.

Here were my first data :

  • dict is a basic memoizer (the canonical example given everywhere),
  • repoze is repoze.lru_cache with maxsize=10000,
  • fixed is memoizer with Baptiste's custom fixed size lookup table  maxsize=10000
  • beaker is beaker.cache in memory and expire time = 600


Repoze.lru_cache is 27.5 seconds where functools lru_cache was 30 seconds, and faster than the crude fixed size lookup table?

So I looked the source code. Even though it was more complex it was thread safe and faster. I did not imagine a brain dead close to the metal implementation would be beaten by such a complex code (in terms of single lines of code and tests (they even recoded/optimized modulo!)).

At least, I still had the pride to think my monotonic messy but efficient cache  would be quite a news. But, repoze.lru_cache had seeded some doubt in my brain. So I decided to double check my results. On a tinge of wit I removed my monotonic cache in contradiction with my measurements. So I tested both repoze.lru_cache in replacement of my monotonic cache and the basic memoizer without any cache on the date formater ... just in case.

  • dict + nc is the dumb memoizer with no cache on date formating,
  • repoze wo mono is repoze.lru_cache caching also the date time formater

What happened? 

In my head, I had already anticipated that on the opposite of web caching, my function could be insufficiently costly to use a complex cache. So I made a «close to the metal one value cache». But I relied too much on my own measurements, and forgot to make the test that matters: real time analysis with the dumb time from bash on real data...

Yes, I won : my monotonic cache is indeed faster than repoze.lru_cache which is my archnemesis in this test, but caching this very fast function is dumb useless.

How could I have avoided all this loss of time?

By remembering that Premature optimization is the root of all evil. Let me show you the big picture
Had I begun optimization after a no cache version would I have discovered I was doing breadcrumbs optimization. 4 seconds faster than beaker is impressive (more than 15%).
Winning 4 seconds in regards to 80 seconds (5%) at the cost of :
  • development time;
  • non thread safe implementation;
  • my code is a battlefield;
  • no unit tests;
  • headaches;
  • close failure
is pathetic. I will probably finalize my API without my cache implementations.

Morals of the story


My measurement missed the overhead of the decorator's call time. A decorator adds a call to a function call, and there is no way to measure it, I over focused on improving cache but I lacked the point that no cache were needed. Never dumbly trust your measures. 

I also (re)learnt my lessons: measures can trigger artefacts, the real data world is often more complex and diverse than the one tested in a beaker. There are some one best way of caching however caching is still data dependant: there is no «one size fits all» caching. If you want to use a cache, it is a good practice to measure objectively that your caching strategy is the good one in situ. There are however some pretty safe strategies.

You should always test your new ideas against real data. Even if you carefully chose your algorithms you might have overlooked one bias (such as overlooking the pathological cases for key deletion in a dict). And what your unit tests may not reveal may appear in production which is the worst place to discover a design bug. Theory is great, so are experiments. They should not be seen as opposed, but rather as complementary.

Caching is like cryptography: you'd best use other's framework

Try to follow the general API, so you can change your backend. There is no such thing as an ultimate cache since caching is really data shape and context dependant.

Never lose sight of the big picture

Had I watched carefully my profiler would I have had noticed what I rediscovered: I was losing time with micro optimization.

Cache benchmarking results

In my test caches are never over filled, because I begin with empty cache and I allocate 10000 entries per cache as a default. yahi.speedshoot taking more than one file as an input, users might have a slight decrease in performance for big series of file with fixed size cache (cost of deleting keys in MutableMappings). However since caching is context dependant (do you parse files from a same sites, or different sites) thus these results may vary with the data you really parse. I cannot guess what is the best strategy, however any cache decorator should be made in a way that you can clean the caches. repoze.lru_cache misses this feature (and I have it \o/).
My monotonic cache is a good idea (ego boosting), but I was wrong on the use case. And as long as I have no case in which it works, it is a dumb idea.

Cache APIs are standard enough (cache provider handling the implementations for the real cache, providing cache invalidation and providing a cache decorator) that you should always make an adapter so that you can change your caching backend.

Beaker.cache is impressive in terms of functionality and could make my code thread safe and memory safe. Complexity is improved and it has a cost, but also some returns on investment.

Memoizer strategy is the fatest cache, and also the simpler to code. It is a good arrow to keep in your quiver in most circumstances, but remember your cache will grow without control.

Read before coding

I learnt a lot about caching with wikipedia, stackoverflow and pycon slides that were all over internet. Before coding, it is always a good idea to search internet, not for code but studies. Standing on the giant's shoulders is not lame, it is about focusing on the exciting new ideas that were not coded. Sometimes, it is fun to try to test the established wisdom by implementing your variant. You may win, you might lose, but you will have fun. Maybe I did not find any reference to a monotonic cache because it is lame, maybe I don't have the good keywords, maybe it is a good idea, I'll never know but  I took the chance. You'll never win if you never fight.

Algorithms are evolving, nothing grants that what you learnt is still up to date (even the iFFT algorithm is being renewed).

One pride

I did not ridiculed myself!

My teachers used to say if your results are astounding you should double check them since there is a 99% probability there is a mistake somewhere.

And they said: even if you fail, remember errors are fruitful since one never learns from its success but from its mistakes. Sharing your mistakes with the community is less rewarding but is as important as sharing your success.

This was my mistake: Premature optimization is the root of all evil.

PS here is my code for plotting:


Marius Gedminas said...

Timestamps in Apache access.log are not necessarily monotonic. Requests are logged as they complete, but the timestamp shows the beginning time.

jul said...

Indeed :) Even though I dropped this as a good idea.

We try to have our content fast enough so that load time is significantly less than 120ms per content (imagine how awkward an HTML page containing 5 js 4 pictures 2 css one favicon would be if it took .5 second each). So I am absolutely wrong and you are right, still experimentally I guess I am often enough right. ^^
Okay, I'll take a few minutes today to measure how often I am right, and I'll give you the result.(I guess a oneliner in Perl will do it)