You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2014/09/17 23:18:01 UTC

How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

The Solr wiki says   "A repeated question is "how can I have the
original term contribute
more to the score than the stemmed version"? In Solr 4.3, the
KeywordRepeatFilterFactory has been added to assist this
functionality. "

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming

(Full section reproduced below.)
I can see how in the example from the wiki reproduced below that both
the stemmed and original term get indexed, but I don't see how the
original term gets more weight than the stemmed term.  Wouldn't this
require a filter that gives terms with the keyword attribute more
weight?

What am I missing?

Tom



---------------------------------------------
"A repeated question is "how can I have the original term contribute
more to the score than the stemmed version"? In Solr 4.3, the
KeywordRepeatFilterFactory has been added to assist this
functionality. This filter emits two tokens for each input token, one
of them is marked with the Keyword attribute. Stemmers that respect
keyword attributes will pass through the token so marked without
change. So the effect of this filter would be to index both the
original word and the stemmed version. The 4 stemmers listed above all
respect the keyword attribute.

For terms that are not changed by stemming, this will result in
duplicate, identical tokens in the document. This can be alleviated by
adding the RemoveDuplicatesTokenFilterFactory.

<fieldType name="text_keyword" class="solr.TextField"
positionIncrementGap="100">
 <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.KeywordRepeatFilterFactory"/>
   <filter class="solr.PorterStemFilterFactory"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
</fieldType>"

Re: How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

Posted by Diego Fernandez <di...@redhat.com>.

I'm not 100% on this, but I imagine this is what happens:

(using -> to mean "tokenized to")

Suppose that you index:

"I am running home" -> "am run running home"

If you then query "running home" -> "run running home" and thus give a higher score than if you query "runs home" -> "run runs home"


----- Original Message -----
> The Solr wiki says   "A repeated question is "how can I have the
> original term contribute
> more to the score than the stemmed version"? In Solr 4.3, the
> KeywordRepeatFilterFactory has been added to assist this
> functionality. "
> 
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
> 
> (Full section reproduced below.)
> I can see how in the example from the wiki reproduced below that both
> the stemmed and original term get indexed, but I don't see how the
> original term gets more weight than the stemmed term.  Wouldn't this
> require a filter that gives terms with the keyword attribute more
> weight?
> 
> What am I missing?
> 
> Tom
> 
> 
> 
> ---------------------------------------------
> "A repeated question is "how can I have the original term contribute
> more to the score than the stemmed version"? In Solr 4.3, the
> KeywordRepeatFilterFactory has been added to assist this
> functionality. This filter emits two tokens for each input token, one
> of them is marked with the Keyword attribute. Stemmers that respect
> keyword attributes will pass through the token so marked without
> change. So the effect of this filter would be to index both the
> original word and the stemmed version. The 4 stemmers listed above all
> respect the keyword attribute.
> 
> For terms that are not changed by stemming, this will result in
> duplicate, identical tokens in the document. This can be alleviated by
> adding the RemoveDuplicatesTokenFilterFactory.
> 
> <fieldType name="text_keyword" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer>
>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>    <filter class="solr.KeywordRepeatFilterFactory"/>
>    <filter class="solr.PorterStemFilterFactory"/>
>    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  </analyzer>
> </fieldType>"
> 

-- 
Diego Fernandez - 爱国
Software Engineer
GSS - Diagnostics