You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andy Liu <an...@gmail.com> on 2006/09/18 23:08:06 UTC

Lopsided scores for each term in BooleanQuery

For multi-word queries, I would like to reward documents that contain a more
even distribution of each word and penalize documents that have a skewed
distribution.  For example, if my search query is:

+content:fast +content:car

I would prefer a document that contains each word an equal number of times
over a document that contains the word "fast" 100 times and the word "car" 1
time.  In other words, I would like to compare the scores of each
BooleanQuery term and adjust the score according to the distribution.

Can somebody point me in the right direction as to how I would implement
this?

Thanks,
Andy

Re: Lopsided scores for each term in BooleanQuery

Posted by Andy Liu <an...@gmail.com>.
In our application we have multiple fields that are searched.  So fast car
becomes:

+(field1:fast field2:fast field3:fast) +(field1:car field2:car field3:car)

I understand that the default sqrt implementation of tf() would help the
"lopsided score" phenomenon with searches within the same field.  But when
searching in multiple fields, this effect is obscured since each matching
field adds to the score of that clause.  Is there a way to "peek" at the
scores of each clause, and adjust based on how divergent the scores are?  Or
is there an easier way to do this that I'm just not seeing?

Andy

On 9/18/06, Paul Elschot <pa...@xs4all.nl> wrote:
>
> On Monday 18 September 2006 23:08, Andy Liu wrote:
> > For multi-word queries, I would like to reward documents that contain a
> more
> > even distribution of each word and penalize documents that have a skewed
> > distribution.  For example, if my search query is:
> >
> > +content:fast +content:car
> >
> > I would prefer a document that contains each word an equal number of
> times
> > over a document that contains the word "fast" 100 times and the word
> "car" 1
> > time.  In other words, I would like to compare the scores of each
> > BooleanQuery term and adjust the score according to the distribution.
> >
> > Can somebody point me in the right direction as to how I would implement
> > this?
>
> It's already there in DefaultSimilarity.tf() which is the square root:
>
> (sqrt(1) + sqrt(1)) > (sqrt(0) + sqrt(2))
>
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lopsided scores for each term in BooleanQuery

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 18 September 2006 23:08, Andy Liu wrote:
> For multi-word queries, I would like to reward documents that contain a more
> even distribution of each word and penalize documents that have a skewed
> distribution.  For example, if my search query is:
> 
> +content:fast +content:car
> 
> I would prefer a document that contains each word an equal number of times
> over a document that contains the word "fast" 100 times and the word "car" 1
> time.  In other words, I would like to compare the scores of each
> BooleanQuery term and adjust the score according to the distribution.
> 
> Can somebody point me in the right direction as to how I would implement
> this?

It's already there in DefaultSimilarity.tf() which is the square root:

(sqrt(1) + sqrt(1)) > (sqrt(0) + sqrt(2))


Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org