You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Stephen Weiss <sw...@stylesight.com> on 2009/06/18 22:38:07 UTC

Destemming snafu

Hi,

I've hit a bit of a problem with destemming and could use some advice.

Right now there is a word in the index called "Stylesight" and another  
word "Stylesightings", which was just added.  When users search for  
"Stylesightings", the client really only wants them to get results  
that match "Stylesightings" and not "Stylesight", as they are two  
[relatively] unrelated things.  However, I'm guessing because of the  
destemmer, "Stylesightings" becomes "Stylesight" internally... which  
results in the "wrong" behavior.

I really don't want to turn off the destemmer, that's like killing an  
ant with a nuke.  I was thinking, perhaps, since we use both index-  
and query-time synonyms, I could make a synonym like this:

"Stylesightings" =>  "xlkje0r923jjfsdf"

or some other random string of un-destemmable junk, that might work,  
but I'm not sure and reindexing all the affected documents will take  
quite some time so it would be good to know in advance if this is even  
a good idea.

Of course, if there's another, better idea, I'd be very open to that  
too.

Thanks for any suggestions!

--
Steve

Re: Destemming snafu

Posted by Stephen Weiss <sw...@stylesight.com>.
Yes, that's exactly what I needed.  I don't know how I missed that.   
Thank you!

--
Steve

On Jun 18, 2009, at 4:49 PM, Brendan Grainger wrote:

> Are you using Porter Stemming? If so I think you can just specify  
> your word in the protwords.txt file (or whatever you've called it).
>
> Check out http://wiki.apache.org/solr/ 
> AnalyzersTokenizersTokenFilters and the example config for the  
> Porter Stemmer:
> <fieldtype name="myfieldtype" class="solr.TextField">
> 	 <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
> <filter class="solr.EnglishPorterFilterFactory"  
> protected="protwords.txt" /> </analyzer>
> </fieldtype>
>
> HTH
> Brendan
>
> On Jun 18, 2009, at 4:38 PM, Stephen Weiss wrote:
>
>> Hi,
>>
>> I've hit a bit of a problem with destemming and could use some  
>> advice.
>>
>> Right now there is a word in the index called "Stylesight" and  
>> another word "Stylesightings", which was just added.  When users  
>> search for "Stylesightings", the client really only wants them to  
>> get results that match "Stylesightings" and not "Stylesight", as  
>> they are two [relatively] unrelated things.  However, I'm guessing  
>> because of the destemmer, "Stylesightings" becomes "Stylesight"  
>> internally... which results in the "wrong" behavior.
>>
>> I really don't want to turn off the destemmer, that's like killing  
>> an ant with a nuke.  I was thinking, perhaps, since we use both  
>> index- and query-time synonyms, I could make a synonym like this:
>>
>> "Stylesightings" =>  "xlkje0r923jjfsdf"
>>
>> or some other random string of un-destemmable junk, that might  
>> work, but I'm not sure and reindexing all the affected documents  
>> will take quite some time so it would be good to know in advance if  
>> this is even a good idea.
>>
>> Of course, if there's another, better idea, I'd be very open to  
>> that too.
>>
>> Thanks for any suggestions!
>>
>> --
>> Steve
>


Re: Destemming snafu

Posted by Brendan Grainger <br...@gmail.com>.
Are you using Porter Stemming? If so I think you can just specify your  
word in the protwords.txt file (or whatever you've called it).

Check out http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters  
and the example config for the Porter Stemmer:
<fieldtype name="myfieldtype" class="solr.TextField">
	 <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/>  
<filter class="solr.EnglishPorterFilterFactory"  
protected="protwords.txt" /> </analyzer>
  </fieldtype>

HTH
Brendan

On Jun 18, 2009, at 4:38 PM, Stephen Weiss wrote:

> Hi,
>
> I've hit a bit of a problem with destemming and could use some advice.
>
> Right now there is a word in the index called "Stylesight" and  
> another word "Stylesightings", which was just added.  When users  
> search for "Stylesightings", the client really only wants them to  
> get results that match "Stylesightings" and not "Stylesight", as  
> they are two [relatively] unrelated things.  However, I'm guessing  
> because of the destemmer, "Stylesightings" becomes "Stylesight"  
> internally... which results in the "wrong" behavior.
>
> I really don't want to turn off the destemmer, that's like killing  
> an ant with a nuke.  I was thinking, perhaps, since we use both  
> index- and query-time synonyms, I could make a synonym like this:
>
> "Stylesightings" =>  "xlkje0r923jjfsdf"
>
> or some other random string of un-destemmable junk, that might work,  
> but I'm not sure and reindexing all the affected documents will take  
> quite some time so it would be good to know in advance if this is  
> even a good idea.
>
> Of course, if there's another, better idea, I'd be very open to that  
> too.
>
> Thanks for any suggestions!
>
> --
> Steve