You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Matthew Runo <mr...@zappos.com> on 2008/01/24 18:58:21 UTC

"runs" vs. "running" - Query time vs Index Time stemming

Hello folks..

I'm seeing something that makes total sense to me, but the pointy  
haired bosses don't like it, so I've gotta come up with a solution. We  
search a pretty standard product catalog, and due to stemming a search  
for "running shoes" matches things with "Runs 1/2 a size large" in the  
product description. I've tried tweaking the Query / Index time  
settings, below, but I still get the stemming. Any ideas on how I can  
make "running" not match "runs" in product descriptions, while still  
keeping the words "run", "runs", "running"... searchable in the  
product descriptions (just not stemming on them).

Here's my field config...

         <fieldType name="text" class="solr.TextField"  
positionIncrementGap="100">
             <analyzer type="index">
                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.SynonymFilterFactory"  
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
                 <filter class="solr.StopFilterFactory"  
ignoreCase="true" words="stopwords.txt"/>
                 <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="1" generateNumberParts="1"
                         catenateWords="1" catenateNumbers="1"  
catenateAll="0" splitOnCaseChange="1"/>
                 <filter class="solr.EnglishPorterFilterFactory"  
protected="protwords.txt"/>
                 <filter  
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                 <filter class="solr.LowerCaseFilterFactory"/>
             </analyzer>
             <analyzer type="query">
                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.StopFilterFactory"  
ignoreCase="true" words="stopwords.txt"/>
                 <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="0" generateNumberParts="1"
                         catenateWords="0" catenateNumbers="0"  
catenateAll="0" splitOnCaseChange="1"/>
                 <filter class="solr.EnglishPorterFilterFactory"  
protected="protwords.txt"/>
                 <filter  
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                 <filter class="solr.LowerCaseFilterFactory"/>
             </analyzer>
         </fieldType>

I don't think I can use stopwords, because I need to be able to search  
on all of these words, just not match "runs" when they search  
"running". In most cases the other stemming is fine, and if possible  
I'd like to not completely turn it off. That is, however, an option.  
It seems to be a solvable problem though - any ideas would be greatly  
appreciated.

Thanks!

Matthew Runo
Software Developer
Zappos.com
702.943.7833

Re: "runs" vs. "running" - Query time vs Index Time stemming

Posted by Ryan McKinley <ry...@gmail.com>.

>                 <filter class="solr.EnglishPorterFilterFactory" 
> protected="protwords.txt"/>

isn't that what protwords.txt does?