You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2008/07/01 17:41:58 UTC
negative boosting / analysis?
Hi-
I'm working on a case where we have review text that may include words
that describe what the item is *not*.
Given the text "the kitten is not clean", searching for "clean" should
not include (at least at the top) the kitten.
The approach I am considering is to copy the text to a negation field
and do simple heuristic analysis in a TokenFilter. This analysis
would only keep tokens for words that follow "not", then we could add
a negative boost for this field:
title^2 content^1 negation^0.1
Does this seem like a reasonable approach? Any other ideas /
suggestions / pointers?
thanks
ryan
Re: negative boosting / analysis?
Posted by Chris Hostetter <ho...@fucit.org>.
I've never really tackled anything like this, but a few things to
consider / watch out for are:
1) if a doc *only* matches because of the negated field do you really
want to consider it a match? Even in the case of dismax, the
minNrShouldMatch aspect is going to is going to consider your megation
field a factor, so you might find documents being considered a match, even
though they don't contain enough of the input terms in "normal" fields
because some terms are in the negated field.
2) the coord factor might wind up throwing your scores off in weird ways
... something that matches on the title, the content, and the negation
field could wind up scoring higher then something that matches only on
title and content because of coord.
There's a "BoostingQuery" in the Lucene queries contrib that (in theory)
helps with some of this by rewriting to a BooleanQuery with a custom coord
function, but i'v never looked at it closely.
: I'm working on a case where we have review text that may include words that
: describe what the item is *not*.
:
: Given the text "the kitten is not clean", searching for "clean" should not
: include (at least at the top) the kitten.
:
: The approach I am considering is to copy the text to a negation field and do
: simple heuristic analysis in a TokenFilter. This analysis would only keep
: tokens for words that follow "not", then we could add a negative boost for
: this field:
: title^2 content^1 negation^0.1
:
: Does this seem like a reasonable approach? Any other ideas / suggestions /
: pointers?
-Hoss