You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Boris Goldowsky <bo...@alum.mit.edu> on 2004/03/17 21:12:57 UTC

Demoting results

Is there any way to build a query where the occurrence of a particular
Term (in a Keyword field) causes the rank of the document to be
decreased?  I have various types of documents, and some of them are less
interesting than others, so I want them to be pushed towards the bottom
of the results ranking.  However, I do not want to eliminate them
entirely, so I can't use a boolean not.

Using negative weights would seem logical here, but apparently has no
effect on rankings - negative weights appear to be treated as zeros.

Any ideas would be appreciated.

Thanks,
Boris


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Demoting results

Posted by Boris Goldowsky <bo...@alum.mit.edu>.

On Fri, 2004-03-19 at 11:58, Doug Cutting wrote:
> Doug Cutting wrote:
> >> On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
> >>
> >>> Have you tried assigning these very small boosts (0 < boost < 1) and 
> >>> assigning other query clauses relatively large boosts (boost > 1)?
> > 
> > I don't think you understood my proposal.  You should try boosting the 
> > documents when you add them.  Instead of adding a "doctype" field with 
> > "good" and "bad" values, use Document.setBoost(0.01) at index time.
> 
> Sorry.  My mistake.  You did understand my proposal, it was just a bad 
> proposal.  Boosting documents is a better approach, but is less 
> flexible.  I think the final proposal in my previous message might be 
> the best approach (defining a custom coordination function for these 
> query clauses).

Thanks for the ideas - I love the flexibility of Lucene that there are
so many ways to accomplish what at first seemed so difficult.

Boris



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Demoting results

Posted by Doug Cutting <cu...@apache.org>.

Doug Cutting wrote:
>> On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
>>
>>> Have you tried assigning these very small boosts (0 < boost < 1) and 
>>> assigning other query clauses relatively large boosts (boost > 1)?
> 
> I don't think you understood my proposal.  You should try boosting the 
> documents when you add them.  Instead of adding a "doctype" field with 
> "good" and "bad" values, use Document.setBoost(0.01) at index time.

Sorry.  My mistake.  You did understand my proposal, it was just a bad 
proposal.  Boosting documents is a better approach, but is less 
flexible.  I think the final proposal in my previous message might be 
the best approach (defining a custom coordination function for these 
query clauses).

Again, sorry for the false accusation,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Demoting results

Posted by Doug Cutting <cu...@apache.org>.

Boris Goldowsky wrote:
> On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
>>Have you tried assigning these very small boosts (0 < boost < 1) and 
>>assigning other query clauses relatively large boosts (boost > 1)?
> 
> I was trying to formulate a query like, say 
>   +(title: asparagus) (doctype:bad)^-3
> 
> which would make sure the "bad" document was ranked lower than any other
> value for doctype.  But negative boosts are illegal. 
> 
> I tried your suggestion of putting large boost on the first clause and a
> small one (0.01) on the second, but the "bad" document is still ranked 
> higher than the good one -- it gets a slight improvement from the
> doctype:bad match, times 0.01, which is a very slight improvement but
> still positive.  Then it gets a big boost because it has a 1.0 rather
> than a 0.5 coordination factor, so the bad item gets top billing.

I don't think you understood my proposal.  You should try boosting the 
documents when you add them.  Instead of adding a "doctype" field with 
"good" and "bad" values, use Document.setBoost(0.01) at index time.

Also, you could disable coordination if you like by defining your own 
Similarity class.

> I think I've identified a few ways to solve the puzzle, though:
> 
> (a) enumerate all the possible "good" types of documents and search for
> them, rather than the single bad one.  Harder to maintain since doctypes
> can be introduced, but possible.

That would indeed work better in an additive scoring system like this.

> (b) attach boost values less than one to the "bad" Documents at indexing
> time.  Not as flexible as modifying the query, but plausible.

Yes, that's what I proposed.  You can reset boost values later now too.

> (c) a more complex query like this:
>  (title:asparagus) OR (title:asparagus -doctype:bad)
>  so for good documents both clauses will match and the coordination
> factor will be in their favor.  This increases query complexity (they
> aren't really simple one-term queries like this toy example), but
> hopefully that will not be a performance issue.

I think modifying the coordination function would be better.  Note that, 
in the current CVS codebase, you can modify the Similarity 
implementation on a per-clause basis.  So you could construct a query 
that had negative coordination, i.e., that gives lower scores when more 
clauses match.  This could be done by subclassing BooleanQuery and 
overriding its getSimilarity(Searcher) method.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Demoting results

Posted by Boris Goldowsky <bo...@alum.mit.edu>.

I asked:
> > Is there any way to build a query where the occurrence of a particular
> > Term (in a Keyword field) causes the rank of the document to be
> > decreased?

On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
> Have you tried assigning these very small boosts (0 < boost < 1) and 
> assigning other query clauses relatively large boosts (boost > 1)?

Thanks for the suggestion!  Unfortunately it doesn't have the desired
effect.  I wanted 
  title: asparagus
  various fields...
  doctype: bad

to score lower than 
  title: asparagus
  various similar fields...
  doctype: good

I was trying to formulate a query like, say 
  +(title: asparagus) (doctype:bad)^-3

which would make sure the "bad" document was ranked lower than any other
value for doctype.  But negative boosts are illegal. 

I tried your suggestion of putting large boost on the first clause and a
small one (0.01) on the second, but the "bad" document is still ranked 
higher than the good one -- it gets a slight improvement from the
doctype:bad match, times 0.01, which is a very slight improvement but
still positive.  Then it gets a big boost because it has a 1.0 rather
than a 0.5 coordination factor, so the bad item gets top billing.

I think I've identified a few ways to solve the puzzle, though:

(a) enumerate all the possible "good" types of documents and search for
them, rather than the single bad one.  Harder to maintain since doctypes
can be introduced, but possible.

(b) attach boost values less than one to the "bad" Documents at indexing
time.  Not as flexible as modifying the query, but plausible.

(c) a more complex query like this:
 (title:asparagus) OR (title:asparagus -doctype:bad)
 so for good documents both clauses will match and the coordination
factor will be in their favor.  This increases query complexity (they
aren't really simple one-term queries like this toy example), but
hopefully that will not be a performance issue.

Bng

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Demoting results

Posted by Doug Cutting <cu...@apache.org>.

Have you tried assigning these very small boosts (0 < boost < 1) and 
assigning other query clauses relatively large boosts (boost > 1)?

Boris Goldowsky wrote:
> Is there any way to build a query where the occurrence of a particular
> Term (in a Keyword field) causes the rank of the document to be
> decreased?  I have various types of documents, and some of them are less
> interesting than others, so I want them to be pushed towards the bottom
> of the results ranking.  However, I do not want to eliminate them
> entirely, so I can't use a boolean not.
> 
> Using negative weights would seem logical here, but apparently has no
> effect on rankings - negative weights appear to be treated as zeros.
> 
> Any ideas would be appreciated.
> 
> Thanks,
> Boris
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org