You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Matthew W. Bilotti" <mb...@csail.mit.edu> on 2004/04/29 20:09:40 UTC

Help with scoring, coordination factor?

Dear Lucene Users,

We are using Lucene 1.4 RC2, and are experiencing curious results that we 
think are related to the coordination term.  Apparently the default 
implementation for coordination is:

(# query terms matched in a document)/(total terms in query)

That seems to imply that given the queries "A v B v C v D", a disjunction
of 4 terms, and "A ^ B ^ C ^ D", a conjunction of four terms that the a
document containing only A would have 1/4 for a coordination score
regardless.

We understand the semantics for coordination where the conjunction of
terms is involved, but for our purposes, we would want coordination for
the disjunction to behave differently.  Take for example these two queries
(1) "A ^ B ^ C", a conjunction of 3 terms, and (2)  "(A v A1 v A2 v A3) ^
(B v B1) ^ (C v C1 v C2)", a conjunction of 3 disjunctions, each of which
contains related terms.

We would like to see a document containing A, B and C have the same 
coordination score regardless of which query we were using.  To us, it 
makes sense to model the disjunction "A" as being a single term that 
matches no matter which of version of A1..A4 appears in the document.

The results we are seeing show documents we are interested in (say, ones
that contain A, B and C) taking a rank penalty when we use query (2) 
rather than query (1).  We suspect the coordination term in driving down 
these documents' ranks and we would like to bring those documents back up 
to where they should be. 

Is there a relatively easy way to implement what we want using Lucene?  
Would it be better to try to supply a Similarity class with a
special-purpose coord method, or would it be better to try subclass Term
to create some kind of term "glob" that would match any of a number of
strings (a disjunction).

Any advice you can give us would be greatly appreciated!  Thanks in 
advance!

Best regards,
Matthew

PS: Is it now possible by any chance to merge documents retrieved by two 
Lucene queries by score, owing perhaps to the queryNorm factor?  Just 
curious.

-- 
matthew w. bilotti
computer science and artificial intelligence laboratory
massachusetts institute of technology





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Help with scoring, coordination factor?

Posted by Julien Nioche <Ju...@lingway.com>.

[I move this discussion to the dev list]

> Then use this in place of BooleanQuery when you don't want coordination
> scoring.  I think that should do the trick.

In my case it works perfectly. As we generate multilingual and semantic
expansions of the original words of a query, the coordination factor was
giving lower score to words with a lot of semantic or morphologic variants.
The Query objects that extends BooleanQuery (let's call them WordQueries
just for clarity of the explanation) are combined into a BooleanQuery object
using the default coord factor.

What I'd like to do now is to be able to give thoseWordQueries an indication
of relevancy, for example if I have the following user query :  "generation
of semantic variants" our system will decide that 'generation' and its
variants (generations, generated, ...) is not particulary important compared
to the term 'variants' which is less important than 'semantic'. Let's give
the terms the following relevancy scores :
generation = 1
semantic = 3
variants = 2

This idea that a given term is carrying more or less information for a given
domain is behind the tf/idf weighting.

Let's take an example. We try this query on an index but no document is
found with all WordQueries. Instead we get a document containing one or more
expansion of the WordQuery 'generation' and one or more expansion of the
WordQuery 'variants' (i-e a document with the following text "... the
generated variant is ..."). On the other hand we find another document
matching the WordQuery 'semantic' and the WordQuery 'variants'.

In the first case the score would be (score WordQuery 'generation'  + score
WordQuery 'variants')*(2/3)
and in the second : (score WordQuery 'semantic' + score WordQuery
'variants')*(2/3)

whatever the scores may be for each WordQuery, what I'd like to have is :

score for the first document : (score WordQuery 'generation'  + score
WordQuery 'variants')*((1+2)/(1+2+3))
score for the second : (score WordQuery 'semantic' + score WordQuery 'varian
ts')*((3+2)/(1+2+3))

I created a new type of Query extending booleanQuery that combines
WordQueries, however the coordination information is currently a boolean
information indicating whether or not a given Query appears in a document.

Does anyone has any idea about how I can achieve this?

Thanks a lot

----- Original Message -----
From: "Doug Cutting" <cu...@apache.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, April 29, 2004 9:37 PM
Subject: Re: Help with scoring, coordination factor?

> Matthew W. Bilotti wrote:
> > We suspect the coordination term in driving down
> > these documents' ranks and we would like to bring those documents back
up
> > to where they should be.
>
> That sounds right to me.
>
> > Is there a relatively easy way to implement what we want using Lucene?
> > Would it be better to try to supply a Similarity class with a
> > special-purpose coord method  [ ... ]
>
> I think this is a good approach.
>
> In 1.4, you can do something like:
>
> public class NoCoordBooleanQuery extends BooleanQuery {
>
>    private static final Similarity SIMILARITY = new DefaultSimilarity {
>      public float coord(int overlap, int max) {
>        return 1.0f;
>      }
>    };
>
>    public Similarity getSimilarity(Searcher searcher) {
>      return SIMILARITY;
>    }
>
> }
>
> Then use this in place of BooleanQuery when you don't want coordination
> scoring.  I think that should do the trick.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Help with scoring, coordination factor?

Posted by Doug Cutting <cu...@apache.org>.

Matthew W. Bilotti wrote:
> We suspect the coordination term in driving down 
> these documents' ranks and we would like to bring those documents back up 
> to where they should be. 

That sounds right to me.

> Is there a relatively easy way to implement what we want using Lucene?  
> Would it be better to try to supply a Similarity class with a
> special-purpose coord method  [ ... ]

I think this is a good approach.

In 1.4, you can do something like:

public class NoCoordBooleanQuery extends BooleanQuery {

   private static final Similarity SIMILARITY = new DefaultSimilarity {
     public float coord(int overlap, int max) {
       return 1.0f;
     }
   };

   public Similarity getSimilarity(Searcher searcher) {
     return SIMILARITY;
   }

}

Then use this in place of BooleanQuery when you don't want coordination 
scoring.  I think that should do the trick.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Help with scoring, coordination factor?

Posted by Ype Kingma <yk...@xs4all.nl>.

On Thursday 29 April 2004 20:09, Matthew W. Bilotti wrote:

I can't help you with your first question about coordination
of disjunctions in conjunctions.

Actually, I would like to have the possibility to provide
all terms in an OR query with the same idf weight, eg. some
avarage of their IDF's, to reflect that they have the same
importance in the query. But that is a slightly different subject.

...
> PS: Is it now possible by any chance to merge documents retrieved by two
> Lucene queries by score, owing perhaps to the queryNorm factor?  Just
> curious.

That's what the query norm is for.
The best comparisons are between queries that differ only in term weights.
But the default scoring method is complex enough to lead to surprises
when queries are more different.

Regards,
Ype

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org