You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Chuck Williams <ch...@manawiz.com> on 2004/10/15 02:57:49 UTC

idf and explain(), was Re: Search and Scoring

Doug,

That's a good point on how the standard vector space inner product
similarity measure does imply that the idf is squared relative to the
document tf.  Even having been aware of this formula for a long time,
this particular implication never occurred to me.  Do you know if
anybody has done precision/recall or other relevancy empirical
measurements comparing this vs. a model that does not square idf?

Regarding normalization, the normalization in Hits does not have very
nice properties.  Due to the > 1.0 threshold check, it loses
information, and it arbitrarily defines the highest scoring result in
any list that generates scores above 1.0 as a perfect match.  It would
be nice if score values were meaningful independent of searches, e.g.,
if "0.6" meant the same quality of retrieval independent of what search
was done.  This would allow, for example, sites to use a a simple
quality threshold to only show results that were "good enough".  At my
last company (I was President and head of engineering for InQuira), we
found this to be important to many customers.

The standard vector space similarity measure includes normalization by
the product of the norms of the vectors, i.e.:

  score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) ) /
                sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ]

This makes the score a cosine, which since the values are all positive,
forces it to be in [0, 1].  The sumOfSquares() normalization in Lucene
does not fully implement this.  Is there a specific reason for that?

Re. explain(), I don't see a downside to extending it show the final
normalization in Hits.  It could still show the raw score just prior to
that normalization.  Although I think it would be best to have a
normalization that would render scores comparable across searches.

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Wednesday, October 13, 2004 9:38 AM
> To: Lucene Developers List
> Subject: Re: Search and Scoring
> 
> Chuck Williams wrote:
> > I think there are at least two bugs here:
> >   1.  idf should not be squared.
> 
> I discussed this in a separate message.  It's not a bug.
> 
> >   2.  explain() should explain the actual reported score().
> 
> This is mostly a documentation bug in Hits.  The normalization of
scores
> to 1.0 is performed only by Hits.  Hits is a high-level wrapper on the
> lower-level HitCollector-based search implementations, which do not
> perform this normalization.  We should probably document that Hits
> scores are so normalized.  Also, we could add a method to disable this
> normalization in Hits.  The normalization was added long ago because
> many folks found it disconcerting when scores were greater than 1.0.
> 
> We should not attempt to normalize scores reported by explain().  The
> intended use of explain() is to compare its output against other calls
> to explain(), in order to understand how one document scores higher
than
> another.  Scores don't make much sense in isolation, and neither do
> explanations.
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Wednesday, October 13, 2004 9:25 AM
> To: Lucene Developers List
> Subject: Re: Contribution: better multi-field searching
> 
> Paul Elschot wrote:
> >>Did you see my IDF question at the bottom of the original note?  I'm
> >>really curious why the square of IDF is used for Term and Phrase
> >>queries, rather than just IDF.  It seems like it might be a bug?
> >
> > I missed that.
> > It has been discussed recently, but I don't remember the outcome,
> > perhaps some else?
> 
> This has indeed been discussed before.
> 
> Lucene computes a dot-product of a query vector and each document
> vector.  Weights in both vectors are normalized tf*idf, i.e.,
> (tf*idf)/length.  The dot product of vectors d and q is:
> 
>    score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
> 
> Given this formulation, and the use of tf*idf weights, each component
of
> the sum has an idf^2 factor.  That's just the way it works with dot
> products of tf*idf/length vectors.  It's not a bug.  If folks don't
like
> it they can simply override Similarity.idf() to return sqrt(super()).
> 
> If someone can demonstrate that an alternate formulation produces
> superior results for most applications, then we should of course
change
> the default implementation.  But just noting that there's a factor
which
> is equal to idf^2 in each element of the sum does not do this.
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: idf and explain(), was Re: Search and Scoring

Posted by Christoph Goller <go...@detego-software.de>.

Chuck wrote:
> That's a good point on how the standard vector space inner product
> similarity measure does imply that the idf is squared relative to the
> document tf.  Even having been aware of this formula for a long time,
> this particular implication never occurred to me.

The same holds for me :-)

Chuck wrote:
> I researched the idf2 issue further and believe that empirical studies
> have consistently concluded that one idf factor should be dropped.
> Salton, the originator of the IR vector space model, decided to drop the
> idf term on documents in order to avoid the squaring.  I hear he did
> this after studying recall and precision for many variations of his
> formula.

The cosine-distance and dot-product motivations should not be a dogma. One
can question the idea of using the same representation (as vector of terms)
for a query and a document. The intuition behind not squaring idf seems
equally well-found as the dot-product/cosine-distance motivation.
However, I also question empirical results here, since the right scoring
of documents is also very subjective :-)

Chuck wrote:
> Regarding normalization, the normalization in Hits does not have very
> nice properties.  Due to the > 1.0 threshold check, it loses
> information, and it arbitrarily defines the highest scoring result in
> any list that generates scores above 1.0 as a perfect match.  It would
> be nice if score values were meaningful independent of searches, e.g.,
> if "0.6" meant the same quality of retrieval independent of what search
> was done.  This would allow, for example, sites to use a a simple
> quality threshold to only show results that were "good enough".  At my
> last company (I was President and head of engineering for InQuira), we
> found this to be important to many customers.
> 
> The standard vector space similarity measure includes normalization by
> the product of the norms of the vectors, i.e.:
> 
>   score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) ) /
>                 sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ]
> 
> This makes the score a cosine, which since the values are all positive,
> forces it to be in [0, 1].  The sumOfSquares() normalization in Lucene
> does not fully implement this.  Is there a specific reason for that?

You are right. The normalization on the documents contribution is missing
and there are the additional coord-factors. The current implementation is a
mixture of dot-product and cosine-distance. Therefore, the additional
normalization in Hits is necessary if one wants to avoid scores > 1.0.

Doug wrote:
> The quantity 'sum(t) weight(t,d)2' must be recomputed for each document each
> time a document is added to the collection, since 'weight(t,d)' is dependent
> on global term statistics.  This is prohibitivly expensive.

weight(t,d) is currently computed on the fly in every scorer. BooleanScorer and
ConjunctionScorer could collect the document normalization, and in the end
the searcher could apply normalization. All Scorers would have to be extended
to supply document normalization. Does this seem reasonable?

Chuck wrote:
> To illustrate the problem better normalization is intended to address,
> in my current application for BooleanQuery's of two terms, I frequently
> get a top score of 1.0 when only one of the terms is matched.  1.0
> should indicate a "perfect match".  I'd like set my UI up to present the
> results differently depending on how good the different results are
> (e.g., showing a visual indication of result quality, dropping the
> really bad results entirely, and segregating the good results from other
> only vaguely relevant results).  To build this kind of "intelligence"
> into the UI, I certainly need to know whether my top result matched all
> query terms, or only half of them.  I'd like to have the score tell me
> directly how good the matches are.  The current normalization does not
> achieve this.
>
> The intent of a new normalization scheme is to preserve the current
> scoring in the following sense:  the ratio of the nth result's score to
> the best result's score remains the same.  Therefore, the only question
> is what normalization factor to apply to all scores.  This reduces to a
> very specific question that determines the entire normalization -- what
> should the score of the best matching result be?

I would prefer your first idea with the cosine normalization, if an efficient
implementation is possible. As stated above, I currently think it is possible.

Christoph









---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: idf and explain(), was Re: Search and Scoring

Posted by Doug Cutting <cu...@apache.org>.

Chuck Williams wrote:
> That's a good point on how the standard vector space inner product
> similarity measure does imply that the idf is squared relative to the
> document tf.  Even having been aware of this formula for a long time,
> this particular implication never occurred to me.  Do you know if
> anybody has done precision/recall or other relevancy empirical
> measurements comparing this vs. a model that does not square idf?

No, not that I know of.

> Regarding normalization, the normalization in Hits does not have very
> nice properties.  Due to the > 1.0 threshold check, it loses
> information, and it arbitrarily defines the highest scoring result in
> any list that generates scores above 1.0 as a perfect match.  It would
> be nice if score values were meaningful independent of searches, e.g.,
> if "0.6" meant the same quality of retrieval independent of what search
> was done.  This would allow, for example, sites to use a a simple
> quality threshold to only show results that were "good enough".  At my
> last company (I was President and head of engineering for InQuira), we
> found this to be important to many customers.

If this is a big issue for you, as it seems it is, please submit a patch 
to optionally disable score normalization in Hits.java.

> The standard vector space similarity measure includes normalization by
> the product of the norms of the vectors, i.e.:
> 
>   score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) ) /
>                 sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ]
> 
> This makes the score a cosine, which since the values are all positive,
> forces it to be in [0, 1].  The sumOfSquares() normalization in Lucene
> does not fully implement this.  Is there a specific reason for that?

The quantity 'sum(t) weight(t,d)^2' must be recomputed for each document 
each time a document is added to the collection, since 'weight(t,d)' is 
dependent on global term statistics.  This is prohibitivly expensive. 
Research has also demonstrated that such cosine normalization gives 
somewhat inferior results (e.g., Singhal's pivoted length normalization).

> Re. explain(), I don't see a downside to extending it show the final
> normalization in Hits.  It could still show the raw score just prior to
> that normalization.

In order to normalize scores to 1.0 one must know the maximum score. 
Explain only computes the score for a single document, and the maximum 
score is not known.

 > Although I think it would be best to have a
 > normalization that would render scores comparable across searches.

Then please submit a patch.  Lucene doesn't change on its own.

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org