You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2012/03/08 14:18:26 UTC

Stemmer Question

I was previously using the PorterStemmer to do stemming and ran into
an issue where it was overly aggressive with some words or
abbreviations which I needed to stop.  I have recently switched to
KStem and I believe the issue is less, but I was wondering still if
there was a way to set a number of stop words for which you didn't
want stemming to occur or if there was a way to tell the Stemmer to
store the unstemmed version as well.  So for instance if a query came
in for "Ahmed", the PorterStemmer would turn that into Ahm, while in
this case Ahmed is a name and I want to search that unstemmed.  If
there was a stop word list I could attempt to compile a list of words
I didn't want stem or if there was a way to say also say create a
token for the unstemmed word so what went into the index for Ahmed
would be "ahmed" "ahm" so we'd cover both cases.  What are the draw
backs of providing both.

Re: Stemmer Question

Posted by Jamie Johnson <je...@gmail.com>.

Barring the horrible name I am wondering if folks would be interested
in having something like this as an alternative to the standard
kstemmer.  This is largely based on the SynonymFilter except it builds
tokens using the kstemmer and the original input.  I've created a JIRA
for this to start discussion.  I'd be really interested in
comments/thoughts on this.

https://issues.apache.org/jira/browse/SOLR-3231


On Fri, Mar 9, 2012 at 4:04 PM, Jamie Johnson <je...@gmail.com> wrote:
> So I've thrown something together fairly quickly which is based on
> what Ahmet had sent that I believe will preserve the original token as
> well as the stemmed version.  I didn't go as far as weighting them
> differently using the payloads however.  I am not sure how to use the
> preserveOriginal attribute from WordDelimeterFilterFactory, can anyone
> provide guidance on that?
>
> On Fri, Mar 9, 2012 at 2:53 PM, Jamie Johnson <je...@gmail.com> wrote:
>> Further digging leads me to believe this is not the case.  The Synonym
>> Filter supports this, but the Stemming Filter does not.
>>
>> Ahmet,
>>
>> Would you be willing to provide your filter as well?  I wonder if we
>> can make it aware of the preserveOriginal attribute on
>> WordDelimterFilterFactory?
>>
>>
>> On Fri, Mar 9, 2012 at 2:27 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> Ok, so I'm digging through the code and I noticed in
>>> org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of
>>> a keepOrig attribute.  Doing some googling led me to
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which
>>> speaks of an attribute preserveOriginal="1" on
>>> solr.WordDelimiterFilterFactory.  So it seems like I can get the
>>> functionality I am looking for by setting preserveOriginal, is that
>>> correct?
>>>
>>>
>>> On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <io...@yahoo.com> wrote:
>>>>> I'd be very interested to see how you
>>>>> did this if it is available. Does
>>>>> this seem like something useful to the community at large?
>>>>
>>>> I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested,  I can provide it publicly too.

Re: Stemmer Question

Posted by Jamie Johnson <je...@gmail.com>.

So I've thrown something together fairly quickly which is based on
what Ahmet had sent that I believe will preserve the original token as
well as the stemmed version.  I didn't go as far as weighting them
differently using the payloads however.  I am not sure how to use the
preserveOriginal attribute from WordDelimeterFilterFactory, can anyone
provide guidance on that?

On Fri, Mar 9, 2012 at 2:53 PM, Jamie Johnson <je...@gmail.com> wrote:
> Further digging leads me to believe this is not the case.  The Synonym
> Filter supports this, but the Stemming Filter does not.
>
> Ahmet,
>
> Would you be willing to provide your filter as well?  I wonder if we
> can make it aware of the preserveOriginal attribute on
> WordDelimterFilterFactory?
>
>
> On Fri, Mar 9, 2012 at 2:27 PM, Jamie Johnson <je...@gmail.com> wrote:
>> Ok, so I'm digging through the code and I noticed in
>> org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of
>> a keepOrig attribute.  Doing some googling led me to
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which
>> speaks of an attribute preserveOriginal="1" on
>> solr.WordDelimiterFilterFactory.  So it seems like I can get the
>> functionality I am looking for by setting preserveOriginal, is that
>> correct?
>>
>>
>> On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <io...@yahoo.com> wrote:
>>>> I'd be very interested to see how you
>>>> did this if it is available. Does
>>>> this seem like something useful to the community at large?
>>>
>>> I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested,  I can provide it publicly too.

Re: Stemmer Question

Posted by Jamie Johnson <je...@gmail.com>.

Further digging leads me to believe this is not the case.  The Synonym
Filter supports this, but the Stemming Filter does not.

Ahmet,

Would you be willing to provide your filter as well?  I wonder if we
can make it aware of the preserveOriginal attribute on
WordDelimterFilterFactory?


On Fri, Mar 9, 2012 at 2:27 PM, Jamie Johnson <je...@gmail.com> wrote:
> Ok, so I'm digging through the code and I noticed in
> org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of
> a keepOrig attribute.  Doing some googling led me to
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which
> speaks of an attribute preserveOriginal="1" on
> solr.WordDelimiterFilterFactory.  So it seems like I can get the
> functionality I am looking for by setting preserveOriginal, is that
> correct?
>
>
> On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <io...@yahoo.com> wrote:
>>> I'd be very interested to see how you
>>> did this if it is available. Does
>>> this seem like something useful to the community at large?
>>
>> I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested,  I can provide it publicly too.

Re: Stemmer Question

Posted by Jamie Johnson <je...@gmail.com>.

Ok, so I'm digging through the code and I noticed in
org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of
a keepOrig attribute.  Doing some googling led me to
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which
speaks of an attribute preserveOriginal="1" on
solr.WordDelimiterFilterFactory.  So it seems like I can get the
functionality I am looking for by setting preserveOriginal, is that
correct?

On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <io...@yahoo.com> wrote:
>> I'd be very interested to see how you
>> did this if it is available. Does
>> this seem like something useful to the community at large?
>
> I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested,  I can provide it publicly too.

Re: Stemmer Question

Posted by Ahmet Arslan <io...@yahoo.com>.

> I'd be very interested to see how you
> did this if it is available. Does
> this seem like something useful to the community at large?

I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested,  I can provide it publicly too.

Re: Stemmer Question

Posted by Jamie Johnson <je...@gmail.com>.

I'd be very interested to see how you did this if it is available. Does
this seem like something useful to the community at large?

On Thursday, March 8, 2012, Ahmet Arslan <io...@yahoo.com> wrote:
>> Thanks the KeywordMarkerFilterFactory
>> seems to be what I was looking
>> for.  I'm still wondering about keeping the unstemmed
>> word as a token
>> though.  While I know that this would increase the
>> index size slightly
>> I wonder what the negative of doing such a thing would
>> be?  Just seems
>> less destructive since I always store the unstemmed version
>> and the
>> stemmed version.  By not storing the unstemmed version
>> there is no way
>> to go back without reindexing. If I wanted to implement this
>> I'm
>> assuming a custom tokenizer would be most appropriate?
>> Does something
>> like this already exist?
>
> Not out-of-the-box. Actually I was using your idea, implemented such
custom token filter by mixing synonym filter and stem filter. This is
useful for wildcard queries. And for normal queries, this could rank exact
matches higher.
>

Re: Stemmer Question

Posted by Ahmet Arslan <io...@yahoo.com>.

> Thanks the KeywordMarkerFilterFactory
> seems to be what I was looking
> for.  I'm still wondering about keeping the unstemmed
> word as a token
> though.  While I know that this would increase the
> index size slightly
> I wonder what the negative of doing such a thing would
> be?  Just seems
> less destructive since I always store the unstemmed version
> and the
> stemmed version.  By not storing the unstemmed version
> there is no way
> to go back without reindexing. If I wanted to implement this
> I'm
> assuming a custom tokenizer would be most appropriate? 
> Does something
> like this already exist?

Not out-of-the-box. Actually I was using your idea, implemented such custom token filter by mixing synonym filter and stem filter. This is useful for wildcard queries. And for normal queries, this could rank exact matches higher.

Re: Stemmer Question

Posted by Jamie Johnson <je...@gmail.com>.

Thanks the KeywordMarkerFilterFactory seems to be what I was looking
for.  I'm still wondering about keeping the unstemmed word as a token
though.  While I know that this would increase the index size slightly
I wonder what the negative of doing such a thing would be?  Just seems
less destructive since I always store the unstemmed version and the
stemmed version.  By not storing the unstemmed version there is no way
to go back without reindexing. If I wanted to implement this I'm
assuming a custom tokenizer would be most appropriate?  Does something
like this already exist?

On Thu, Mar 8, 2012 at 8:36 AM, Ahmet Arslan <io...@yahoo.com> wrote:
>> I was previously using the
>> PorterStemmer to do stemming and ran into
>> an issue where it was overly aggressive with some words or
>> abbreviations which I needed to stop.  I have recently
>> switched to
>> KStem and I believe the issue is less, but I was wondering
>> still if
>> there was a way to set a number of stop words for which you
>> didn't
>> want stemming to occur or if there was a way to tell the
>> Stemmer to
>> store the unstemmed version as well.  So for instance
>> if a query came
>> in for "Ahmed", the PorterStemmer would turn that into Ahm,
>> while in
>> this case Ahmed is a name and I want to search that
>> unstemmed.  If
>> there was a stop word list I could attempt to compile a list
>> of words
>> I didn't want stem or if there was a way to say also say
>> create a
>> token for the unstemmed word so what went into the index for
>> Ahmed
>> would be "ahmed" "ahm" so we'd cover both cases.  What
>> are the draw
>> backs of providing both.
>
> StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for these kind of purposes.
> http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming
>
>
>
>

Re: Stemmer Question

Posted by Ahmet Arslan <io...@yahoo.com>.

> I was previously using the
> PorterStemmer to do stemming and ran into
> an issue where it was overly aggressive with some words or
> abbreviations which I needed to stop.  I have recently
> switched to
> KStem and I believe the issue is less, but I was wondering
> still if
> there was a way to set a number of stop words for which you
> didn't
> want stemming to occur or if there was a way to tell the
> Stemmer to
> store the unstemmed version as well.  So for instance
> if a query came
> in for "Ahmed", the PorterStemmer would turn that into Ahm,
> while in
> this case Ahmed is a name and I want to search that
> unstemmed.  If
> there was a stop word list I could attempt to compile a list
> of words
> I didn't want stem or if there was a way to say also say
> create a
> token for the unstemmed word so what went into the index for
> Ahmed
> would be "ahmed" "ahm" so we'd cover both cases.  What
> are the draw
> backs of providing both.

StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for these kind of purposes. 
http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming