You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Eric Jain <Er...@isb-sib.ch> on 2006/01/09 10:34:36 UTC

Scoring by number of terms in field

Lucene seems to prefer matches in shorter documents. Is it possible to 
influence the scoring mechanism to have matches in shorter fields score 
higher instead?

For example, a query for "europe" should rank:

1. title:"Europe"
2. title:"History of Europe"
3. title:"Travel in Europe, Middle East and Africa"
4. subtitle:"Fairy Tales from Europe"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring by number of terms in field

Posted by Eric Jain <Er...@isb-sib.ch>.

Paul Elschot wrote:
> In case you prefer to use the maximum score over the clauses you
> can use the DisjunctionMaxQuery from the development version.

Yes, that may help! I'll need to have a look...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring by number of terms in field

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 10 January 2006 07:32, Eric Jain wrote:
> Paul Elschot wrote:
> >>For example, a query for "europe" should rank:
> >>
> >>1. title:"Europe"
> >>2. title:"History of Europe"
> >>3. title:"Travel in Europe, Middle East and Africa"
> >>4. subtitle:"Fairy Tales from Europe"
> > 
> > Perhaps with this query (assuming the default implicit OR):
> > 
> > title:europe subtitle:europe^0.5 body:europe
> 
> This will ensure that match 4 appears at the end, but as far as I can see 
> this won't help with getting matches 1-3 ordered correctly? Note that match 
> 1 for example may have a "description" field that contains a lot terms, but 
> no mention of the query term.

In general, the length of a field that does not match does not influence
the score, but it's still not easy to predict the order in this case.
With an OR over multiple fields:
- when a shorter field matches, it usually dominates the score for a
document.
- when other fields match, they contribute to the score via the sum
score over the clauses and via the coordination factor in the
DefaultSimilarity.

In case you prefer to use the maximum score over the clauses you
can use the DisjunctionMaxQuery from the development version.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring by number of terms in field

Posted by Eric Jain <Er...@isb-sib.ch>.

Paul Elschot wrote:
>>For example, a query for "europe" should rank:
>>
>>1. title:"Europe"
>>2. title:"History of Europe"
>>3. title:"Travel in Europe, Middle East and Africa"
>>4. subtitle:"Fairy Tales from Europe"
> 
> Perhaps with this query (assuming the default implicit OR):
> 
> title:europe subtitle:europe^0.5 body:europe

This will ensure that match 4 appears at the end, but as far as I can see 
this won't help with getting matches 1-3 ordered correctly? Note that match 
1 for example may have a "description" field that contains a lot terms, but 
no mention of the query term.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring by number of terms in field

Posted by Paul Elschot <pa...@xs4all.nl>.

On Monday 09 January 2006 10:34, Eric Jain wrote:
> Lucene seems to prefer matches in shorter documents. Is it possible to 
> influence the scoring mechanism to have matches in shorter fields score 
> higher instead?

A query is always in at least one field of a document.

> 
> For example, a query for "europe" should rank:
> 
> 1. title:"Europe"
> 2. title:"History of Europe"
> 3. title:"Travel in Europe, Middle East and Africa"
> 4. subtitle:"Fairy Tales from Europe"

Perhaps with this query (assuming the default implicit OR):

title:europe subtitle:europe^0.5 body:europe

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring by number of terms in field

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Sorry for the quick reply, but yes you can accomplish this by  
tweaking a custom Similarity implementation (or DefaultSimilarity  
subclass).  Check out IndexSearcher.explain on a query and a document  
and then tinker.

	Erik


On Jan 9, 2006, at 4:34 AM, Eric Jain wrote:

> Lucene seems to prefer matches in shorter documents. Is it possible  
> to influence the scoring mechanism to have matches in shorter  
> fields score higher instead?
>
> For example, a query for "europe" should rank:
>
> 1. title:"Europe"
> 2. title:"History of Europe"
> 3. title:"Travel in Europe, Middle East and Africa"
> 4. subtitle:"Fairy Tales from Europe"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org