Determining Link Quality – Statistical Inference
This post was made Jan 22, 2008 by Carlos del Rio
At the risk of butchering a statistical principle I will explain how search engines can determine the quality of links.
Search engines have a difficult job to perform. They need to take quantitative values (data points) and fulfill a qualitative criteria, users’ value of relevance. To accomplish this the search engines create algorithms — sets of rules for solving a problem in a finite set of steps — for users who solves problems heuristically. Heuristics are short cuts for making decisions, often they save time but they are not consistently true. A heuristic process is the reason why a child may call all objects with four legs and fur a dog, in spite of any other characteristics.
The search engines bridge the gap between human short cuts and mathematical processes by employing statistical inference. Statistical inference means taking known values and extrapolating the behavior of the unknown whole. One application of statistical inference called the likelihood principle — if you have a known outcome you can mathematically prove what unknown factor is most likely responsible — allows the context of a link to be given a value in relation to other links.
It Is Better to be Popular
It can be taken as given that humans qualitatively feel that popular is better than not popular, it is a psychological phenomena sometimes called a popularity effect . So we can create a qualitatively “trustworthy” value by weighting toward volume of links, because humans accept popularity as value. As we inspect other properties of the popular sites we can assess the likelihood that each one will produce a positive value to the end user.
Comparing Apples to Search Engines

Here is how it works, greatly simplified. Ask a room of 100 people four questions:
- Did you eat an apple at lunch?
- Name anyone who ate an apple at lunch?
- Name anyone you know in this room?
- Name the people you ate lunch with?
If you assign a numeric value to response of each question you will be given a ranking telling you the likelihood not only of a given person eating an apple but also the likelihood of any given person eating in the room (If the person did not eat in the room half all of their points). People with the lowest scores and who list non-participants in #4 are likely to have eaten outside of the room. If we know that at least one person ate an apple then we are finding the answer not only to “Who ate and apple?” we are also collecting data about “Who ate in this room?” and assessing a quality of trust.
Who Do You Trust?
Based on a know outcome, “Who ate an apple?” we know which individuals are most likely to offer correct information on subsequent questions. The people who are ranked highest for “Ate an apple” are most likely to have eaten in the room, with other people, and likely are the most widely known individuals, so they are most likely to observe accurate data about any given individual in the room.
With each subsequent question individuals continue to differentiate themselves qualitatively by the commonality between their answers and the answers of the winners of our known outcome. As the number of questions increases pattens of expertise form, e.g. one group is more likely to produce valuable response than another group. Given an outside observation all members who share a defined commonality of answers are likely to be common in most other aspects. Here things start to become recursive, but, in theory, we can assess the likelihood that any given answer is qualitatively good by its relative relation to individuals that gave the winning answer in the known outcome.
Or
If I know one statement to be true I can figure out the likelihood that a later answer is valuable based on whether it comes from someone who is sufficiently similar to someone who answered the first question correctly.
Putting it together.
A given link may be biased by any number of factors but the intersection of any measurable data points that can be related to another link can overcomes the individual in favor of an assumed value.
There is a major caveat to this post, I am not implying that the above process means there is such thing as relevant links. All links are, at core, just a reference to the destination — but there are principles that allow quantitative aspects of a link to stand in place of qualitative assessments that a human would make.

