You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Joaquin Delgado <jo...@triplehop.com> on 2004/11/01 05:55:59 UTC
RE: About Hit Scoring

Note that the dot product in the vector space world is heavily assoicated with the concept of correlation coeficiet n statistics:
 
"A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables. 

There are a number of different correlation coefficients that might be appropriate depending on the kinds of variables being studied."

Yet another definition: "The correlation coefficient is a quantity that gives the quality of a least squares fitting to the original data.". Least square fitting is also used in the K-nearest neighbor algorithms for other types of classification and similarity/relevance calculations.

>From the view of probabilistic models of information retrieval the dot product of TFIDF wieghts s equivalent to a baysian infrerence of the probablities of a document being relevant given the query with no prior (knowledge) of relevancy asumming (very naively) that the words are independent from each other.

A  very interesting presentation about formal models for IR including most recent Language (generation) models is exposed at http://www.sis.pitt.edu/~erasmus/week4.ppt

More informtion about different models and how they relate can be found in "Modern Information Retrieval:" http://www.sims.berkeley.edu/~hearst/irbook/chapters/chap2.html

The main problem that I have with the vector-space and probabilistic IR models and algorithms is  that they all assume that there is a linear relationship between the query and the document ( as to how a document to document distance or similarity is calculated) and that words are independent and follow a jointly normal distribution. I find interesting how in statistics however people have been able to work around these assumption and come up with things such as the Spearman rank correlation coeficient.

''The Spearman rank correlation coefficient is one example of a correlation coefficient. It is usually calculated on occasions when it is not convenient, economic, or even possible to give actual values to variables, but only to assign a rank order to instances of each variable. It may also be a better indicator that a relationship exists between two variables when the relationship is non-linear.

Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient, for making inferences about the population correlation coefficient make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate"

This also seams like a good start also for calculations of merged relevance ranking when exact values of multiple ranking systems/algorithms cannot be obtain or are not compatible but ranking order is available.

Sorry fr this long email Just my 2 cents.

Joaquin Delgado, PhD

CTO, TripleHop Technologies, Inc.



________________________________

From: Christoph Goller [mailto:goller@apache.org]
Sent: Sun 10/31/2004 11:00 AM
To: Lucene Developers List
Subject: About Hit Scoring



I looked at the scoring mechanism more closely again. Some of you may
remember that there was a discussion about this recently. There was
especially some argument about the theoretical justification of
the current scoring algorithm. Chuck proposed that at least from
a theoretical perspective it would be good to apply a normalization
on the document vector and thus implement the cosine similarity.

Well, we found out that this cannot be implemented efficienty.
However, I now found out the the current algorithm has a very
intuitive theoretical justification. Some of you may already know
that, but I never looked into it that deeply.

Both the query and all documents are represented as vectors in term
vector space. The current scoring is simply the dot product of the
query with a document normalized by the length of the query vector
(if we skip the additional coord factor). Geometrically speaking this
is the distance of the document vector from the hyperplane through
the origin which is orthogonal to the query vector. See attached
figure.

Christoph