Why big data is a fraud: the actual dot com bubble according to CS 101

Just yesterday I was told: you know this big O notation, this database index, it is 30 years old. It is not true any more.

And I answered: "well, Newton said that if you jump from the 7th stair of this bulding you should die, it is 200 years old, why don't you give a try. this knowledge is so old it should be obsolete"

The fact is big O notation still matters.


CS 101 cheatsheet


Basically, the first lesson in CS 101 is called learning about complexity. It is very basic (except for the readers of hackernews that needs D3js animation to think they understand something and cannot read text).

It says, whatever you do, the bigger the size of a container, the more it takes resources to retrieve an item. However, you can trade memory for speed (and vice versa).

Electronic 101 says: you can have all the memory you want, but the bigger the size the more it will cost you in wiring. However you can trade indirection (speed) for money. But linear addressing is growing more than linearly so you should trade speed for money. Which means you have a diminishing returns.

What does it means?

Imagine you are poor and have one pair of socks. How much time to get it? One iteration.

You are rich, you have 1000 pairs of socks and want to find one. How much iteration?
Well, it depends. If you are organized and have space (memory) you can pre organize you socks in a fancy way. It will take on average an order of magnitude of ~log(1000) to find your socks. If you are poor (or not knowledgeable in the art of organizing socks) you will have to to examine all of your socks with an expectation of 1000 / 2 odds on average to find your socks. (worst case being 1000 with a probability of 1/1000)

Hum 7 CPU cycles vs 500 seems pretty good.

After all, actual computer have up to 16 cores doing 2.8 cycles per seconds (limit is speed of light and heat dissipation).

Well, no.

Transactions (stuff that yield money) are requiring a time line. An order. This is guaranteed by using a simple core. So you are bound to 2.8 cycles per seconds for transactions as long as you don't have a relationship...

You may use multi threading, but you will have point of serialization (join/lock/memory barrier) and doubling the cores/process/threads/instances/containers tends to yield a +40% increase in speed. It has diminishing returns.

In the long run CS 101 basically says:  "the more data you stores, the more resources you will use to search for an item and resource". Costs are a monotonic growing in function of the size with an efficiency that is less than linear.

Basically if 100 customers cost you 1$ to handle, 1000 customers will always costs you more than 10$.


How bad is it?

Bad.

Going in register is 1 cycle
Going in cache is ~150 cycle
Going in RAM  ~500 cycle
Going on a Hard Drive is 15000 cycles...

You can trade memory for speed, but the cost of indirection make the system have an absolute minimum.

You may use cache. But is just diminishes the latency as long as you are in under-run situation. It just mask the problem with a delayed effect.


But resource use however smart you are is more than linear function of the size of collection in which you search aggravated by relationships.

It basically means that even if you are google or facebook, the more data and customers you have the more your costs will increase.

What it means is in terms of economy: the more customers you have the lesser your profits.


Are dot com stupids?



Hell no. All is a question of opportunity. Thanks to QE the stock exchange is full of liquidities.

This diminishing returns are noticeable/measurable only after a big enough growth. Time series are growing linearly with time, and the customer base... grows its way (slowly most of the times and IPO is often before the full success).

Venture capitalists are in for the money. They don't care if the market will not be sustainable in 15 years, they aim for a profit in 10 years where the effects will not be noticeable.

Developers are ... well ... clueless or needing money to reimburse their student loans, and powerless.


Customers... well... if they don't adopt the new technology that seems to be 10 x less expensive (for now) they don't care about the 5 next years if they are wiped out by the concurrence in 2 years.

And founders are either clueless (and lucky), or they have enough money to mask the problem. (When you will sell your shares for 10 times your investment after 5 years, you can invest on diminishing returns for 5 years, it is just a problem of how wealthy you are, and in most occidental economies it basically correlates with the fact of being born wealthy).

I say it loud: a business with systematic diminishing returns is doomed. If the more customers, the lesser the profits then something is wrong.

The stock market is totally out of touch with the economical reality, and this is a sign of a bubble.


.