You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2008/07/01 17:41:58 UTC

negative boosting / analysis?

Hi-

I'm working on a case where we have review text that may include words  
that describe what the item is *not*.

Given the text "the kitten is not clean", searching for "clean" should  
not include (at least at the top) the kitten.

The approach I am considering is to copy the text to a negation field  
and do simple heuristic analysis in a TokenFilter.  This analysis  
would only keep tokens for words that follow "not", then we could add  
a negative boost for this field:
   title^2 content^1 negation^0.1

Does this seem like a reasonable approach?  Any other ideas /  
suggestions / pointers?

thanks
ryan

Re: negative boosting / analysis?

Posted by Chris Hostetter <ho...@fucit.org>.
I've never really tackled anything like this, but a few things to 
consider / watch out for are:

1) if a doc *only* matches because of the negated field do you really 
want to consider it a match?  Even in the case of dismax, the 
minNrShouldMatch aspect is going to is going to consider your megation 
field a factor, so you might find documents being considered a match, even 
though they don't contain enough of the input terms in "normal" fields 
because some terms are in the negated field.

2) the coord factor might wind up throwing your scores off in weird ways 
... something that matches on the title, the content, and the negation 
field could wind up scoring higher then something that matches only on 
title and content because of coord.

There's a "BoostingQuery" in the Lucene queries contrib that (in theory) 
helps with some of this by rewriting to a BooleanQuery with a custom coord 
function, but i'v never looked at it closely.

: I'm working on a case where we have review text that may include words that
: describe what the item is *not*.
: 
: Given the text "the kitten is not clean", searching for "clean" should not
: include (at least at the top) the kitten.
: 
: The approach I am considering is to copy the text to a negation field and do
: simple heuristic analysis in a TokenFilter.  This analysis would only keep
: tokens for words that follow "not", then we could add a negative boost for
: this field:
:   title^2 content^1 negation^0.1
: 
: Does this seem like a reasonable approach?  Any other ideas / suggestions /
: pointers?



-Hoss