You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Scott Smith <ss...@mainstreamdata.com> on 2006/12/09 02:25:45 UTC

de-boosting fields

I have a collection of documents for which I've always returned the
results sorted on the date/time of the document (using a sort object in
the search method on my Searcher).  It works great.

 

Suddenly, I have a requirement to return the documents in relevancy
order.  So, that's easy (I thought); simply call search() without a sort
object.  Unfortunately, the results I got were not what I expected.  So,
I added some code to have lucene explain how it was getting the score
and then things became clearer.  

 

Each document has all of the words in the document indexed in a field
called "Body" (vanilla unstored, indexed field).  However, there is also
some category information which is kept in a keyword field called
"Category".  A document may belong to a large number of categories
(10-70).  

 

When I search, I generate a query which says "give me all of the
documents, in relevancy order, which contain one or more of the
following words: word1, word2, word 3-and it also must be in at least
one of the following categories: category1, category2, ..., categoryN.

 

What I found was that lucene was using the category information as part
of what it uses to compute the relevancy score (in hindsight, not too
surprising).  The problem is that the numbers from the category hits in
"Category" overwhelm the numbers from the word hits in the "Body".  So,
my most relevant document may only have a single word hit and a document
way down in the list (in terms of relevancy) might have a number of word
hits.  For example, in one search, the top scoring document scored
.2650.  Of that, the category information contributed .2635 to that
score-meaning the word hits only contributed .0015 to the relevancy.
This is the opposite of what I want.

 

I'd be happy to simply eliminate the category information from the score
computation all together (base relevancy scores only on the words which
hit in the "Body" field).  Another solution would be to change the boost
on the category information to some small number (zero?) or raise the
Body field boost to a much larger number or both.

 

What is the best way to do this?  Is changing the boost the right
answer?  Can a field's boost be zero?  Is there a way to write a custom
scorer that gets inserted somewhere?  Any suggestions would be
appreciated.

 

Scott

RE: de-boosting fields

Posted by Scott Smith <ss...@mainstreamdata.com>.

"overkill" - I just meant that lucene walking through all of the category terms to compute the score when I know that part of the result will be zero because the boost is zero seems inefficient.

Based on your comment about filters, it sounds like zero boost is the way to go.  The category information that can vary widely from search to search.

________________________________

From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
Sent: Sat 12/9/2006 4:01 PM
To: java-user@lucene.apache.org
Subject: RE: de-boosting fields

: I've googled for custom scorers and haven't found anything.  If anyone
: can point me to some posts, that would be appreciated.

you really don't need a custom Scorer for what you are describing.  custom
Scorers are used with Custom Query classes, and there's relaly nothing
custom about hte query you are trying to execute.

at a fundemental level, what you really want to do is construct a "Filter"
the restricts your results to a set of "categories", and then "Query"
against it using some "keywords" -- i use pair those concepts up in that
order explicitly because you want your score to be based on the keywords
and not to be based on the categories -- Filtering on keyword and QUerying
on categories would not be what you want.

that said, if you've already got a lot of code written that builds up a
boolean query on the keywords and hte categories and passes it on to
another part of your code base to execute hte search, then setting the
boost of the category queries to 0 is a perfectly valid way to go that
probably requires making hte fewest number of changes ... it should have
the same effect as the Filter approach, and the only question is wether or
not it's fast neough for you and if your category constraints are reused
often enough to be worth caching as filters.

: Sounds like setting the boost to zero works (see Daniel Naber's post),
: but that seems like overkill.

"overkill" is a strage description ... can you clarify what you mean?

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: de-boosting fields

Posted by Chris Hostetter <ho...@fucit.org>.

: I've googled for custom scorers and haven't found anything.  If anyone
: can point me to some posts, that would be appreciated.

you really don't need a custom Scorer for what you are describing.  custom
Scorers are used with Custom Query classes, and there's relaly nothing
custom about hte query you are trying to execute.

at a fundemental level, what you really want to do is construct a "Filter"
the restricts your results to a set of "categories", and then "Query"
against it using some "keywords" -- i use pair those concepts up in that
order explicitly because you want your score to be based on the keywords
and not to be based on the categories -- Filtering on keyword and QUerying
on categories would not be what you want.

that said, if you've already got a lot of code written that builds up a
boolean query on the keywords and hte categories and passes it on to
another part of your code base to execute hte search, then setting the
boost of the category queries to 0 is a perfectly valid way to go that
probably requires making hte fewest number of changes ... it should have
the same effect as the Filter approach, and the only question is wether or
not it's fast neough for you and if your category constraints are reused
often enough to be worth caching as filters.

: Sounds like setting the boost to zero works (see Daniel Naber's post),
: but that seems like overkill.

"overkill" is a strage description ... can you clarify what you mean?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: de-boosting fields

Posted by Erick Erickson <er...@gmail.com>.

I meant search this mail archive....

Erick

On 12/9/06, Scott Smith <ss...@mainstreamdata.com> wrote:
>
> I've googled for custom scorers and haven't found anything.  If anyone can
> point me to some posts, that would be appreciated.
>
> Sounds like setting the boost to zero works (see Daniel Naber's post), but
> that seems like overkill.
>
> I'll take a look at filters as well.
>
> ________________________________
>
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Fri 12/8/2006 7:06 PM
> To: java-user@lucene.apache.org
> Subject: Re: de-boosting fields
>
>
>
> I've certainly seen references to writing custom scorers, so it's
> possible.
> you might find valuable hints by searching the mail archive. I'll leave it
> to the more expert folks to suggest which is your best option.
>
> Although (and I'm talking beyond my competence here), it *may* work for
> you
> to assemble a Filter for the category part of your query and use that
> instead of including the category in your query. As I understand it,
> filters
> don't contribute (or all contribute identically) to the score, leaving the
> search you're doing on body to determine your relevance, which seems like
> what you're after. Filters even work with something called a
> ConstantScoreQuery as I remember, which is a hint <G>.
>
> But again, don't be surprised if one of the more expert folks comes up
> with
> a *much* better idea <G>
>
> Best
> Erick
>
>
>
> On 12/8/06, Scott Smith <ss...@mainstreamdata.com> wrote:
> >
> > I have a collection of documents for which I've always returned the
> > results sorted on the date/time of the document (using a sort object in
> > the search method on my Searcher).  It works great.
> >
> >
> >
> > Suddenly, I have a requirement to return the documents in relevancy
> > order.  So, that's easy (I thought); simply call search() without a sort
> > object.  Unfortunately, the results I got were not what I expected.  So,
> > I added some code to have lucene explain how it was getting the score
> > and then things became clearer.
> >
> >
> >
> > Each document has all of the words in the document indexed in a field
> > called "Body" (vanilla unstored, indexed field).  However, there is also
> > some category information which is kept in a keyword field called
> > "Category".  A document may belong to a large number of categories
> > (10-70).
> >
> >
> >
> > When I search, I generate a query which says "give me all of the
> > documents, in relevancy order, which contain one or more of the
> > following words: word1, word2, word 3-and it also must be in at least
> > one of the following categories: category1, category2, ..., categoryN.
> >
> >
> >
> > What I found was that lucene was using the category information as part
> > of what it uses to compute the relevancy score (in hindsight, not too
> > surprising).  The problem is that the numbers from the category hits in
> > "Category" overwhelm the numbers from the word hits in the "Body".  So,
> > my most relevant document may only have a single word hit and a document
> > way down in the list (in terms of relevancy) might have a number of word
> > hits.  For example, in one search, the top scoring document scored
> > .2650.  Of that, the category information contributed .2635 to that
> > score-meaning the word hits only contributed .0015 to the relevancy.
> > This is the opposite of what I want.
> >
> >
> >
> > I'd be happy to simply eliminate the category information from the score
> > computation all together (base relevancy scores only on the words which
> > hit in the "Body" field).  Another solution would be to change the boost
> > on the category information to some small number (zero?) or raise the
> > Body field boost to a much larger number or both.
> >
> >
> >
> > What is the best way to do this?  Is changing the boost the right
> > answer?  Can a field's boost be zero?  Is there a way to write a custom
> > scorer that gets inserted somewhere?  Any suggestions would be
> > appreciated.
> >
> >
> >
> > Scott
> >
> >
> >
> >
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: de-boosting fields

Posted by Scott Smith <ss...@mainstreamdata.com>.

I've googled for custom scorers and haven't found anything.  If anyone can point me to some posts, that would be appreciated.  
 
Sounds like setting the boost to zero works (see Daniel Naber's post), but that seems like overkill.
 
I'll take a look at filters as well.

________________________________

From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Fri 12/8/2006 7:06 PM
To: java-user@lucene.apache.org
Subject: Re: de-boosting fields



I've certainly seen references to writing custom scorers, so it's possible.
you might find valuable hints by searching the mail archive. I'll leave it
to the more expert folks to suggest which is your best option.

Although (and I'm talking beyond my competence here), it *may* work for you
to assemble a Filter for the category part of your query and use that
instead of including the category in your query. As I understand it, filters
don't contribute (or all contribute identically) to the score, leaving the
search you're doing on body to determine your relevance, which seems like
what you're after. Filters even work with something called a
ConstantScoreQuery as I remember, which is a hint <G>.

But again, don't be surprised if one of the more expert folks comes up with
a *much* better idea <G>

Best
Erick



On 12/8/06, Scott Smith <ss...@mainstreamdata.com> wrote:
>
> I have a collection of documents for which I've always returned the
> results sorted on the date/time of the document (using a sort object in
> the search method on my Searcher).  It works great.
>
>
>
> Suddenly, I have a requirement to return the documents in relevancy
> order.  So, that's easy (I thought); simply call search() without a sort
> object.  Unfortunately, the results I got were not what I expected.  So,
> I added some code to have lucene explain how it was getting the score
> and then things became clearer.
>
>
>
> Each document has all of the words in the document indexed in a field
> called "Body" (vanilla unstored, indexed field).  However, there is also
> some category information which is kept in a keyword field called
> "Category".  A document may belong to a large number of categories
> (10-70).
>
>
>
> When I search, I generate a query which says "give me all of the
> documents, in relevancy order, which contain one or more of the
> following words: word1, word2, word 3-and it also must be in at least
> one of the following categories: category1, category2, ..., categoryN.
>
>
>
> What I found was that lucene was using the category information as part
> of what it uses to compute the relevancy score (in hindsight, not too
> surprising).  The problem is that the numbers from the category hits in
> "Category" overwhelm the numbers from the word hits in the "Body".  So,
> my most relevant document may only have a single word hit and a document
> way down in the list (in terms of relevancy) might have a number of word
> hits.  For example, in one search, the top scoring document scored
> .2650.  Of that, the category information contributed .2635 to that
> score-meaning the word hits only contributed .0015 to the relevancy.
> This is the opposite of what I want.
>
>
>
> I'd be happy to simply eliminate the category information from the score
> computation all together (base relevancy scores only on the words which
> hit in the "Body" field).  Another solution would be to change the boost
> on the category information to some small number (zero?) or raise the
> Body field boost to a much larger number or both.
>
>
>
> What is the best way to do this?  Is changing the boost the right
> answer?  Can a field's boost be zero?  Is there a way to write a custom
> scorer that gets inserted somewhere?  Any suggestions would be
> appreciated.
>
>
>
> Scott
>
>
>
>
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: de-boosting fields

Posted by Erick Erickson <er...@gmail.com>.

I've certainly seen references to writing custom scorers, so it's possible.
you might find valuable hints by searching the mail archive. I'll leave it
to the more expert folks to suggest which is your best option.

Although (and I'm talking beyond my competence here), it *may* work for you
to assemble a Filter for the category part of your query and use that
instead of including the category in your query. As I understand it, filters
don't contribute (or all contribute identically) to the score, leaving the
search you're doing on body to determine your relevance, which seems like
what you're after. Filters even work with something called a
ConstantScoreQuery as I remember, which is a hint <G>.

But again, don't be surprised if one of the more expert folks comes up with
a *much* better idea <G>

Best
Erick



On 12/8/06, Scott Smith <ss...@mainstreamdata.com> wrote:
>
> I have a collection of documents for which I've always returned the
> results sorted on the date/time of the document (using a sort object in
> the search method on my Searcher).  It works great.
>
>
>
> Suddenly, I have a requirement to return the documents in relevancy
> order.  So, that's easy (I thought); simply call search() without a sort
> object.  Unfortunately, the results I got were not what I expected.  So,
> I added some code to have lucene explain how it was getting the score
> and then things became clearer.
>
>
>
> Each document has all of the words in the document indexed in a field
> called "Body" (vanilla unstored, indexed field).  However, there is also
> some category information which is kept in a keyword field called
> "Category".  A document may belong to a large number of categories
> (10-70).
>
>
>
> When I search, I generate a query which says "give me all of the
> documents, in relevancy order, which contain one or more of the
> following words: word1, word2, word 3-and it also must be in at least
> one of the following categories: category1, category2, ..., categoryN.
>
>
>
> What I found was that lucene was using the category information as part
> of what it uses to compute the relevancy score (in hindsight, not too
> surprising).  The problem is that the numbers from the category hits in
> "Category" overwhelm the numbers from the word hits in the "Body".  So,
> my most relevant document may only have a single word hit and a document
> way down in the list (in terms of relevancy) might have a number of word
> hits.  For example, in one search, the top scoring document scored
> .2650.  Of that, the category information contributed .2635 to that
> score-meaning the word hits only contributed .0015 to the relevancy.
> This is the opposite of what I want.
>
>
>
> I'd be happy to simply eliminate the category information from the score
> computation all together (base relevancy scores only on the words which
> hit in the "Body" field).  Another solution would be to change the boost
> on the category information to some small number (zero?) or raise the
> Body field boost to a much larger number or both.
>
>
>
> What is the best way to do this?  Is changing the boost the right
> answer?  Can a field's boost be zero?  Is there a way to write a custom
> scorer that gets inserted somewhere?  Any suggestions would be
> appreciated.
>
>
>
> Scott
>
>
>
>
>
>
>

RE: de-boosting fields

Posted by Scott Smith <ss...@mainstreamdata.com>.

One other thing I discovered that I mention so no one else is tripped up
by it.  

I set the boost to zero for the categories in the query.  When I ran my
unit tests, some of them started to fail.  I eventually realized that
the failures were in searches where I only wanted to find documents in
the right category (i.e., I wasn't looking for documents with any
particular words in them); I would get zero hits.  I'm assuming this is
because even though the document belonged to a category I was looking
for, because lucene computed its score as 0, it assumed I didn't want
that document.

I guess that all makes sense, it just means I have to be careful as to
which queries I set the category boost to zero and which I don't.

-----Original Message-----
From: Scott Smith [mailto:ssmith@mainstreamdata.com] 
Sent: Tuesday, December 12, 2006 3:31 PM
To: java-user@lucene.apache.org
Subject: RE: de-boosting fields

I've implemented the zero boost solution and it seems to be doing what I
want.  Thanks to everyone who had suggestions.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Monday, December 11, 2006 11:45 AM
To: java-user@lucene.apache.org
Subject: Re: de-boosting fields


: Isn't it also true that using Field.Index.NO_NORMS when creating the
field will
: remove it from the scoring formula?  I thought I read that somewhere,
but now
: can't find where.

queries on fields with NO_NORMS will still contribute to the score, but
the field *length* and/or field bosts won't contribute to the score.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: de-boosting fields

Posted by Scott Smith <ss...@mainstreamdata.com>.

I've implemented the zero boost solution and it seems to be doing what I
want.  Thanks to everyone who had suggestions.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Monday, December 11, 2006 11:45 AM
To: java-user@lucene.apache.org
Subject: Re: de-boosting fields

: Isn't it also true that using Field.Index.NO_NORMS when creating the
field will
: remove it from the scoring formula?  I thought I read that somewhere,
but now
: can't find where.

queries on fields with NO_NORMS will still contribute to the score, but
the field *length* and/or field bosts won't contribute to the score.

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: de-boosting fields

Posted by Chris Hostetter <ho...@fucit.org>.

: Isn't it also true that using Field.Index.NO_NORMS when creating the field will
: remove it from the scoring formula?  I thought I read that somewhere, but now
: can't find where.

queries on fields with NO_NORMS will still contribute to the score, but
the field *length* and/or field bosts won't contribute to the score.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: de-boosting fields

Posted by Antony Bowesman <ad...@teamware.com>.

Daniel Naber wrote:
> On Saturday 09 December 2006 02:25, Scott Smith wrote:
> 
>> What is the best way to do this?  Is changing the boost the right
>> answer?  Can a field's boost be zero?
> 
> Yes, just use: term1 term2 category1^0 category2^0. Erick's Filter idea is 
> also useful.

Isn't it also true that using Field.Index.NO_NORMS when creating the field will 
remove it from the scoring formula?  I thought I read that somewhere, but now 
can't find where.

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: de-boosting fields

Posted by Daniel Naber <lu...@danielnaber.de>.

On Saturday 09 December 2006 02:25, Scott Smith wrote:

> What is the best way to do this?  Is changing the boost the right
> answer?  Can a field's boost be zero?

Yes, just use: term1 term2 category1^0 category2^0. Erick's Filter idea is 
also useful.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org