You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by "tavi.nathanson" <ta...@gmail.com> on 2010/03/02 04:03:24 UTC

Boosting on *unique* term matches without using MUST

Hey everyone,

Let me start with an example query: [apple orange banana]

I would like to heavily boost documents containing a greater number of
unique query terms (apple, orange, banana), without MUST'ing the terms; in
other words, a document containing just 2 unique terms (apple, banana)
should have a higher score than a document containing 10 or 20 of the same
term (10 apple's). I'm using SHOULD right now, and TF is defeating me;
documents containing a ton of the *same* term are overpowering documents
with a few unique terms.

Is there a standard way to accomplish what I'm looking for? I can think of
several hacks, but I don't really like them:
- I can do a union of query with MUST and a query with SHOULD, and boost the
MUST part, but that doesn't help me with a document that contains apple and
banana (but not orange).
- Perhaps I could lower the impact of TF (although I'm not sure what the
best way of doing this would be).

Thanks so much!
-- 
View this message in context: http://old.nabble.com/Boosting-on-*unique*-term-matches-without-using-MUST-tp27751744p27751744.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Boosting on *unique* term matches without using MUST

Posted by Doug Cutting <cu...@apache.org>.
This question probably belongs on java-user@, not general@.

That said, coord() might be what you're looking for:

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/Similarity.html#coord%28int,%20int%29

Doug

tavi.nathanson wrote:
> Hey everyone,
> 
> Let me start with an example query: [apple orange banana]
> 
> I would like to heavily boost documents containing a greater number of
> unique query terms (apple, orange, banana), without MUST'ing the terms; in
> other words, a document containing just 2 unique terms (apple, banana)
> should have a higher score than a document containing 10 or 20 of the same
> term (10 apple's). I'm using SHOULD right now, and TF is defeating me;
> documents containing a ton of the *same* term are overpowering documents
> with a few unique terms.
> 
> Is there a standard way to accomplish what I'm looking for? I can think of
> several hacks, but I don't really like them:
> - I can do a union of query with MUST and a query with SHOULD, and boost the
> MUST part, but that doesn't help me with a document that contains apple and
> banana (but not orange).
> - Perhaps I could lower the impact of TF (although I'm not sure what the
> best way of doing this would be).
> 
> Thanks so much!