The web is big. It is a network which value is to be connected. And we need an index to access it, else information are lost. It is called search engine. And I think they function to well.
Okay some engines are plain unusable. I am talking about the good ones.
What is a document's meaning?
It is at first approximation a vector in a non orthogonal multi dimensional base constituted by the invariant form of words with their occurrences.
This vector points to a direction. For describing this direction we use "key words".
You can visualize it as a transformation of a whole text into "the smallest canonical non reducible key words" that are idempotent to a bigger text. Like a mapping to the space of sets of words to a sub space of set of words. Key words forming a new base to express the meaning of thousands of words in a synthetic way.
A documents contents can be easily expressed in a base of key words that are "strong meaning full words". You can thus reduce language and ideas without to much loss of meaning.
These vectors can be measured and you can make normalization, cross products, scalar product.
A scalar product is a projection of a vector on a vector and it results in telling you how much time vector A is compared to vector B. Hence, you can after normalization sort easily and compare text that are similar to the key world ideal texts. You can also "compress with loss" a text in a smaller base made of key words. This what being a base is. A reduction to the smaller set of dimensions that are orthogonal. A reduction of the degrees of liberty. Geometrically, it makes sense.
Given the fact we have very fine tools in Euclidean geometry, with 2500 year old knowledge it is a very convenient way to represents text.
There are some caveats of course.
At the opposite of school geometry, the base is not complete... language are not all constructed the same ... there are more than one form, ambiguities ... This is what NLP deals with. And it is freaking harder than doing geometry. But I am focusing on geometry right now. I consider NLP as an accidental problem not an essential one on this topic.
The meaning of a word can change according to the context meaning that "diagonalisation" requires to sometimes degenerate a dimension (word meaning) in more than one according to the other words.
Words have a little uncertainty in their meaning. And a small step for an algorithm is not a step for a staircase.
So... how do you actually make the magic of compressing a 10k word document into 1 or more keywords?
The way it is done is by taking human beings that are very good at tagging text and let them define the keywords for corpuses of text. And learn. I guess machine learning automate this process. You can by using enough tagger "diminish the bias" of the human taggers with statistical treatment used in everyday experimental activities. It works.
You can make a statistical analysis to then determine according to the input what is the separate probabilities for one or n orthogonal dimensions (made of a linear combination of words or a single one from the input text) to appear when a given keyword is given. Xhi² is a great tool for this. You measure positive, negative contributions and you also for each dimension considered check it is not random. For instance "the", "that", "a", "an" tends to not be correlated to any keywords so you can filter them out as not being part of any basis of any keywords. You diminish the degrees of liberty without loosing meaning.
You deduce from this or other methodologies from a learning corpus ways to guess keywords from frequential analysis. (text to keyword)
Of course, you can use meta data to change the occurence (tittle can be considered more heavy than words in the chapters/section/paragraph)
Then, you can just do cosinus similarities from the sets of "ideal documents" triggered by the keywords matching score using distances. Distances following the imperative properties of being defined normed and positive. So you can actually choose other norms than L2 (classical euclidean norm).
It gives you a relation of order thus a ranking.
Until now, I am fine with this.
I guess machine learning comes into handy for over industrializing this.
However one thing bugs me. As much as the difference between precise and exact.
Feedback loops... with amplification.
S = - k . ln(O)
What makes information is having the less numerous more relevant choices being shown to you over the whole corpus of information. The "first page" accuracy.
Basically search engine relevance tends to minimize absolutely the informational entropy. Which seems a good goal.
If a kid ask for a recipe of a cake and fall on porn this is not cool.
However, because of using the "network/social/link/domain" context, we introduce a feedback loop based on how much "other people" rate the keywords validity. And without being a wizard, I guess mathematicians already guessed that collecting data on the "personnal" context of a user help increase the relevance based on what you expect and what your social context tends to find relevant. And it is cool too.
If I need to do text processing, it might point me to the "state of the art" if my neighborhood are professional. In a professional context it standardizes the education. Leading you to stuff like stackoverflow where discussion happens and letting you avoid a lot of pit traps. To be honest I don't know if search engines go as far as using sociogram as an input. But, that would increase the relevance of keywords in a given social context.
And, for instance if I ask a keyword for something ambiguous on which I am biased, it is better for increasing my liking of the results to show me what I like initially.
Just like when you go on youtube ask for "joe dassin" and when you are a metalhead (like me) youtube will show you on the next suggestions a lot of joe dassin and metal. Not rap, or traditional music, or whatever. Just metal. (it is much more like a moving average with a weight decreasing over time to be honest, but still this is accidental not essential).
It is indeed what I like and I am often pleased with it, and I do indeed make some nice discoveries.
It also reinforce my biases with time. I go on youtube sometimes to be surprised, to discover stuff.
And I feel cornered into a caricature of my own self.
And I fear that most of us of get reinforced in our own biases. But these are just feelings and theories and vague intuition. Nothing tangible.
I guess with time and enough data about people's query over time we could measure if my hypothesis are real or not. We could measure the evolution of the musical choices and diversity of "patterns" in playlist according to the age of persons over time. (melodies, arrangements, artists, ....) and we could influence people's culture.
Clustering of opinions reinforced by social networks.
This one is simple. Some people don't want to change. Some people don't want to hear the earth is flat, other it is patatoidal, and some spheric.
Me, I love Sir Terry Pratchett's Disc world and Erasthothènes and watch NASA pictures of the earth. So I am okay with all the possible shapes of the earth. Even the ring shaped earth from Niven's SF.
However it is not everybody's case, and some people with biases prefer to concentrate in clusters of reading/writing that are mutually enforcing belief.... like some conspiracy theorists.
For instance we all fear propaganda from terrorism on the internet. But how does it happens you never randomly fell on one of these sites, and oppositely how can this person never get in touch with your culture? You know they exist, but you never had the occasion to speak with them and magic of humanity happening sometimes help them turn into better persons. You could also fell for their idea to be honest. So should we be scared? Are some people irreversibly bad?
I am a great fan of Periclès. He used to say "polemic is life".
The world of progress (as opposed to immobility) comes from ideas not words.
Words are imperfect media for ideas, because ideas are grey, intangible, a moving target ...
And for this moving target to progress, it requires dialog/exchanges that are not always comfortable. Yes basically I say polemists (called trolls nowadays) are a necessary evil of all progressive regime. Do we need progress? Tell me : is the world in a trajectory you like? Is global warming cool? Are wars cool? Is terrorism cool? Is the increase of pollution, poverty cool?
Well, I don't benefit any of these, so my own personal contextual selfish answer is no. I want society to progress. Can I do it alone? No. So I have to able to be in touch with other people and dialog.
Making people see what they want above all at my own personal opinion (I share with myself) goes against the acceptation of diversity of points of views and dialog.
Search algorithm will get all the more precise that the feedback loop will reinforce the contextual meaning of them.
But exact is not precise. pi = 3.14159 is precise. pi = 4 [+- 2] is exact in Euclidean geometry. pi = 4 is both exact and precise in Taxicab geometry.
A potential solution to this unproven problem ?
In multi agent simulation based on physical statistics they used to model people's rationality in accordance to Fermi Dirac or Maxwell Boltzman distribution of energy. As if "economical agent" were rational but for modeling the uncertainty/irrationality of human behaviour they would add a factor temperature. Something saying : well there is a clear advantage for agent X to behave this way, BUT you never know. This temperature factor could vary more or less. A parameter you could set in accordance to real world observations. Basically you'd replace a fully deterministic algorithm by one tainted with some randomization. The "amount" of randomization being related to its physical equivalence of temperature.
In some model magnetic model could be used to model the influence of the neighborhood. Sometimes positively (better use the same software as industry is demanding) sometimes negatively (I don't want to wear the same shirt as my neighbor).
What I loved was a simulation on the behaviour of the fish market in Marseilles.
They had a simulation that basically validate an experimental strategy used by buyer that was to be loyal to ONE buyer (because you get discounts for instance), but sometimes explore the competition in case the remaining competition either increase its competivity or your buyer decrease.
Trust you are right, but check.
These simulation were not the true world. Sometimes they were matching experimental evidences though. And making the market converge to less instability in prices, less fish thrown away.
However some stuffs were perturbing. In a simulation without temperature the agents can evolve in non interacting clusters or in constant noise. Both cases would lead to an unstable market with a global loss of utility/income for everyone. Lose - lose situations.
Another stuff was perturbing if you made the "influence" parameter recomputed every turn according to the distance and effectiveness of the influence, the more a cluster was polarized and strong, the more it would make itself harder, and could become irreversible and the compensation in temperature to fix this states would go higher. Making the problem non reversible.
Slowly ghettoing people in their own behaviour.
When I look at my social networks, they all seem clusterized this way .. in a sort of progression of radicalization of opinions.
I do feel a disturbance in the increased use of algorithm that works to well to show me what I want.
my little brain could be wrong. Who am I to question the smartest
engineers in the world when I a kind of small imperfect person on the
Maybe you are perfect. I am not. I am human, I do err, I do make
mistakes, and I like to believe in my capacity of correcting myslef. For
this, I need to be exposed to "noise".
Please dear search engines, give me back my capacity to lower my
biases and give me a setting for loosing the "precision" on your result.
I want a stupid button ranging from "I don't want to see noise 'cause I
am focused on technical problem solving it the one best way" to "I am
in the mood for exploring the world and question myself and see totally
crazy surprising stuffs".
I would gladly accept to be "polarized" in the one best way of thinking if first I believed in a non ambiguous proven immutable truth, and also if it did not resulted in increasing the violence of the exchanges. Something about this could trigger instability and violent moves at my opinion.
And I dare say it could be measured. By applying measures of the
entropy on searches over time. But economically I fear there are
indirect incentives for polarizing people's opinion when you are both
judge of what is relevant and benefit from directing people in comfortable
clusters that generates revenues. Why do not we want to see it? Because, we all enjoy to live a peaceful life without conflicts. Sometimes like a frog in water slowly heating and so numbed by comfort we forget to jump out of the water when temperature gets critical.
But, I am not like every frogs I also want to see what I do not want so that I can apply my own critical judgement to my own self and improve.
I could use more than one search engine you think like on a fish market. But actually we know one engine has the biggest overall relevance that also influences the direction of the other ones. So maybe the "temperature" factor is fubar for this case and we may rely on the big elephant in the room to wake up.
I just wonder if by avoiding small conflicts now for comfort we are not building up a bigger more violent one later.