You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Александр Аристов <cl...@mail.ru> on 2008/08/08 11:53:53 UTC

Re[4]: lucene scoring

Relevance ranking is an option but we still won't be able compare results. Lets say we have distributed searching - in this case top 10 from one server is not the same as those which are from another. Even worse we may get that in the resulting set a document with most top score is worse than others.

what if we disable normalization or make it constant will results be absolutely dummy?

And anther approach, can we calculate most possible top value? Or just maybe approximation of it? we then would be able to compare results with it.

Alex

-----Original Message-----
From: Grant Ingersoll <gs...@apache.org>
To: java-dev@lucene.apache.org
Date: Thu, 7 Aug 2008 15:54:41 -0400
Subject: Re: Re[2]: lucene scoring

On Aug 7, 2008, at 3:05 PM, Александр Аристов wrote:

> I want implement searching with ability to set so-called a  
> confidence level below which I would treat documents as garbage. I  
> cannot defile the level per query as the level should be relevant  
> for all documents.
>
> With current scoring implementation the level would mean nothing. I  
> don't believe that since that time (the thread is of 2005year)  
> nothing has been made towards the resolving the issue.

That's because there is no resolution to be had, as far as I know, but  
I'm open to suggestions (patches are even better.)  What would it mean  
to say that a score of 0.5 for "baby kittens" is comparable to a score  
of 0.5 for "death metal"?  Like I said, I don't think that 0.5 for  
"baby kittens" is even comparable later if you added other documents  
that contain any of the query terms.

>
>
> Do you think any workarounds like implementing more sophisticated  
> queries so that we have approximately the same normalization values?

I just don't think you will be successful with this, and I don't  
believe it is a Lucene issue alone, but one that applies to all search  
engines, but I could be wrong.

I get what you are trying to do, though, I've wanted to do it from  
time to time.   Another approach may be to look for significant  
differences between scores w/in a result set.   For example, if doc 1  
is 0.8, doc 2 is 0.79 and then doc 3 is 0.2, then maybe one could  
argue that doc 3 is garbage, but even that is somewhat of a stretch.   
Garbage truly is in the eye of the beholder.

Another option is to do more relevance tuning to make sure your top 10  
are as good as possible so that your garbage is minimized.

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Re[4]: lucene scoring

Posted by "J. Delgado" <jo...@gmail.com>.

The only score that I can think of that can measure "quality" across
different queries are invariant scores such as pagerank. That is to score
the document on its general information value and then use that as a filter
regardless of the query. This is very different than the problem of trying
to nomalize the score on the same query over different shards (indexes) in a
federated query setting, which has been researched extensively.

The reason why two queries have different "scale" for scores is because of
the probabilistic nature of the algorithms which view word occurences as
independent random variables. Thus the occurence of each word in a document
is treated as an independent event. Joint and conditional probabilities can
estimated looking at word co-occurence, which could be used to compare two
specific results (i.e. how relevant is document X to both "baby kittens" and
"death metal" or if "baby kitten" is present in a doc how likely is that
"death metal" is present too), but to use the TF-IDF based score as as
absolute measure is like trying to compare Pears with Apples. Trying to
nomalize it is an ill-defined task.

-- J.D.



2008/8/8 Александр Аристов <cl...@mail.ru>

> Relevance ranking is an option but we still won't be able compare results.
> Lets say we have distributed searching - in this case top 10 from one server
> is not the same as those which are from another. Even worse we may get that
> in the resulting set a document with most top score is worse than others.
>
> what if we disable normalization or make it constant will results be
> absolutely dummy?
>
> And anther approach, can we calculate most possible top value? Or just
> maybe approximation of it? we then would be able to compare results with it.
>
> Alex
>
>
> -----Original Message-----
> From: Grant Ingersoll <gs...@apache.org>
> To: java-dev@lucene.apache.org
> Date: Thu, 7 Aug 2008 15:54:41 -0400
> Subject: Re: Re[2]: lucene scoring
>
>
> On Aug 7, 2008, at 3:05 PM, Александр Аристов wrote:
>
> > I want implement searching with ability to set so-called a
> > confidence level below which I would treat documents as garbage. I
> > cannot defile the level per query as the level should be relevant
> > for all documents.
> >
> > With current scoring implementation the level would mean nothing. I
> > don't believe that since that time (the thread is of 2005year)
> > nothing has been made towards the resolving the issue.
>
> That's because there is no resolution to be had, as far as I know, but
> I'm open to suggestions (patches are even better.)  What would it mean
> to say that a score of 0.5 for "baby kittens" is comparable to a score
> of 0.5 for "death metal"?  Like I said, I don't think that 0.5 for
> "baby kittens" is even comparable later if you added other documents
> that contain any of the query terms.
>
> >
> >
> > Do you think any workarounds like implementing more sophisticated
> > queries so that we have approximately the same normalization values?
>
> I just don't think you will be successful with this, and I don't
> believe it is a Lucene issue alone, but one that applies to all search
> engines, but I could be wrong.
>
> I get what you are trying to do, though, I've wanted to do it from
> time to time.   Another approach may be to look for significant
> differences between scores w/in a result set.   For example, if doc 1
> is 0.8, doc 2 is 0.79 and then doc 3 is 0.2, then maybe one could
> argue that doc 3 is garbage, but even that is somewhat of a stretch.
> Garbage truly is in the eye of the beholder.
>
> Another option is to do more relevance tuning to make sure your top 10
> are as good as possible so that your garbage is minimized.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>