You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vishnu Mishra <vd...@gmail.com> on 2015/08/25 10:42:36 UTC

Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).

Hi,

I was working with Lucene 5.2 and trying to index some document. I am using
EnglishMinimalStemFilterFactory and I found that there is no option for
keeping the original text as wel as analyzed term into lucene index.
WordDelimiterFilterFactory  provides preserveOriginal option to do this. Can
anyone tell me why this option is not provided for Stemming. For e.g. if I
want to store both *Methods* and *Method* in my index then I think there is
no option is available in Lucene to do this.  I also noticed that if we
place EnglishMinimalStemFilterFactory after WordDelimiterFilterFactory with
option preserveOriginal ="1" then  it store both *Methods* and *Method*. 





--
View this message in context: http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In-Stemming-EnglishMinimalStemFilterFactory-tp4225116.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,


> So the "usual" answer is either to use the KeywordRepeatFilterFactory, or
> use a copyField that doesn't stem and when exact matches are required,
> search on that field.

Or even better search on both fields (stemmed and unstemmed, I generally also have a ASCII-folded one) with SHOULD. An exact match would get higher score (because it hits both closes, stemmed and unstemmed field), while an only-stem match automatically gets a lower score (because only one Boolean clause matches).

Best,
Uwe
 
> Best,
> Erick
> 
> On Tue, Aug 25, 2015 at 5:05 AM, Modassar Ather
> <mo...@gmail.com> wrote:
> > Can
> > anyone tell me why this option is not provided for Stemming.
> >
> > I am not sure about it but the original token can be preserved by
> > using <filter class="solr.KeywordRepeatFilterFactory"/> too.
> > To avoid any duplicate token in the document <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/> can be used at the
> > end of analysis chain.
> >
> > Hope this helps.
> >
> > Regards,
> > Modassar
> >
> > On Tue, Aug 25, 2015 at 2:12 PM, Vishnu Mishra <vd...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I was working with Lucene 5.2 and trying to index some document. I am
> >> using EnglishMinimalStemFilterFactory and I found that there is no
> >> option for keeping the original text as wel as analyzed term into lucene
> index.
> >> WordDelimiterFilterFactory  provides preserveOriginal option to do this.
> >> Can
> >> anyone tell me why this option is not provided for Stemming. For e.g.
> >> if I want to store both *Methods* and *Method* in my index then I
> >> think there is no option is available in Lucene to do this.  I also
> >> noticed that if we place EnglishMinimalStemFilterFactory after
> >> WordDelimiterFilterFactory with option preserveOriginal ="1" then  it
> store both *Methods* and *Method*.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >> http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In-
> Stemmi
> >> ng-EnglishMinimalStemFilterFactory-tp4225116.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).

Posted by Erick Erickson <er...@gmail.com>.

It's actually a real pain to do this right considering all the different
analysis chains. As Modassar says, the KeywordRepeatFilterFactory
is often "good enough". It'll boost the exact match, but it won't actually
guarantee that only exact-match docs are returned.

Ideally, you'd want the option to turn unstemmed match on and off. But
to do that, you have to have some way to signal the analysis chain when
to emit only the original token at _query_ time. So say there's some rule
like "when you see a $ appended to the term, it shouldn't be stemmed
at query time".

Now if WordDelimiterFilterFactory comes before the stemmer (as it really
should), the $ is removed and the signal to not stem is lost. Oops. And
any of the ReplaceFilterFactories often remove such terms. And....

So the "usual" answer is either to use the KeywordRepeatFilterFactory,
or use a copyField that doesn't stem and when exact matches are
required, search on that field.

Best,
Erick

On Tue, Aug 25, 2015 at 5:05 AM, Modassar Ather <mo...@gmail.com> wrote:
> Can
> anyone tell me why this option is not provided for Stemming.
>
> I am not sure about it but the original token can be preserved by using
> <filter class="solr.KeywordRepeatFilterFactory"/> too.
> To avoid any duplicate token in the document <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> can be used at the end of
> analysis chain.
>
> Hope this helps.
>
> Regards,
> Modassar
>
> On Tue, Aug 25, 2015 at 2:12 PM, Vishnu Mishra <vd...@gmail.com> wrote:
>
>> Hi,
>>
>> I was working with Lucene 5.2 and trying to index some document. I am using
>> EnglishMinimalStemFilterFactory and I found that there is no option for
>> keeping the original text as wel as analyzed term into lucene index.
>> WordDelimiterFilterFactory  provides preserveOriginal option to do this.
>> Can
>> anyone tell me why this option is not provided for Stemming. For e.g. if I
>> want to store both *Methods* and *Method* in my index then I think there is
>> no option is available in Lucene to do this.  I also noticed that if we
>> place EnglishMinimalStemFilterFactory after WordDelimiterFilterFactory with
>> option preserveOriginal ="1" then  it store both *Methods* and *Method*.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In-Stemming-EnglishMinimalStemFilterFactory-tp4225116.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Preserve Original Option In Stemming (EnglishMinimalStemFilterFactory).

Posted by Modassar Ather <mo...@gmail.com>.

Can
anyone tell me why this option is not provided for Stemming.

I am not sure about it but the original token can be preserved by using
<filter class="solr.KeywordRepeatFilterFactory"/> too.
To avoid any duplicate token in the document <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/> can be used at the end of
analysis chain.

Hope this helps.

Regards,
Modassar

On Tue, Aug 25, 2015 at 2:12 PM, Vishnu Mishra <vd...@gmail.com> wrote:

> Hi,
>
> I was working with Lucene 5.2 and trying to index some document. I am using
> EnglishMinimalStemFilterFactory and I found that there is no option for
> keeping the original text as wel as analyzed term into lucene index.
> WordDelimiterFilterFactory  provides preserveOriginal option to do this.
> Can
> anyone tell me why this option is not provided for Stemming. For e.g. if I
> want to store both *Methods* and *Method* in my index then I think there is
> no option is available in Lucene to do this.  I also noticed that if we
> place EnglishMinimalStemFilterFactory after WordDelimiterFilterFactory with
> option preserveOriginal ="1" then  it store both *Methods* and *Method*.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Preserve-Original-Option-In-Stemming-EnglishMinimalStemFilterFactory-tp4225116.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>