You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2007/10/15 16:19:51 UTC

sanity check on how stemming, stopwords, and snowball analyzer works together

Could those "in the know" comment on my current understanding of stemming 
and stopwords using the snowball analyzer?

In my application, I am using the MoreLikeThis class to find similar 
documents to an input "text blob". There are words in the input text blob 
which are "uninteresting" for my application, so I create a list of these 
words. These words are "uninteresting" no matter what their tense or 
usage, for example, "develop", "developing", "developed", and "developer" 
are all uninteresting and I do not want them included in the search query 
created by the MoreLikeThis class.

My index documents are stemmed using the Snowball analyzer. I do not use 
any stopwords when the documents are indexed (as I would like the choice 
of stopwords to be under user control at search time).

I would like the user to be able to provide to the search application a 
list of "uninteresting" words, and for obvious reasons would like to force 
them to provide only, say, "developer" and have the application understand 
that all variants should be ignored (and I don't want to force them to try 
to guess what the stemmed version of "developer" is).

My first try was to use MoreLikeThis with the Snowball analyzer and a 
simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and 
MoreLikeThis.setStopWords). However, it appears that the stopwords 
provided to the MoreLikeThis class are compared in an exact way to the 
token stream output by the Snowball filter (where the words have been 
stemmed), so "developer" will not match anything, and all variants pass 
through. Even if I provide the list of unstemmed stopwords to the snowball 
analyzer instead, they are used "as-is" with no stemming performed, so 
"developer" will not remove "developed". 

Apparently the following is necessary for my application:
Construct a snowball analyzer with no stopwords. Use the unstemmed 
stopword list with the analyzer to construct a stemmed version of the set 
of stopwords. Use this set of stemmed stopwords as the stopwords input to 
the MoreLikeThis class (where the tokens are compared to the stemmed 
versions after been output from the Snowball analyzer).

Is my understanding correct?

Donna

Re: sanity check on how stemming, stopwords, and snowball analyzer works together

Posted by Mark Miller <ma...@gmail.com>.

It depends on the order of the filters in your Analyzer. You would want 
to be sure you put the StopWord filter before the Stemming filter. The 
reason that the MoreLikeThis class does not do as you want is that first 
it applies the Analyzer (which stems) and THEN it applies its custom 
stop word removal. If you pass an Analyzer that removes stop words 
before stemming, you don't have to worry about the stemming at all. The 
stopword 'uninteresting' would be removed before the stemming even 
occurred in the analyzer. The tokens from the analyzer would then be fed 
to the MoreLikeThis stop word removal scheme...but you could just have 
that list be empty as its too late anyway...you would have already done 
your stop word removal with the Analyzer rather than with the 
MoreLikeThis stop word removal scheme...which can only occur after an 
Analyzer has been fully applied to the text. Frankly, I don't know why 
MoreLikeThis supports its own stopword list...you can always do it in a 
custom analyzer that you pass to MoreLikeThis, which gives you more 
control of when the stopword removal is applied (say before or after 
stemming). Sugar I guess.

- Mark

Donna L Gresh wrote:
> I wasn't sure this:
> Instead add the stopwords to the analyzer that 
>   
>> you pass to MoreLikeThis. That way you can ensure that the analyzer 
>> applies the stopword list before stemming 
>>     
>
> would work, because I don't want to provide all the variants of the 
> stopword list-- if I do this, only the one provided will be removed, 
> correct?
>
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>
>
> Mark Miller <ma...@gmail.com> wrote on 10/15/2007 10:37:22 AM:
>
>   
>> Sounds right to me.
>>
>> The other option I think you have is to not use the MoreLikeThis 
>> stopword functionality. Instead add the stopwords to the analyzer that 
>> you pass to MoreLikeThis. That way you can ensure that the analyzer 
>> applies the stopword list before stemming (The MoreLikeThis stopword 
>> removal is implemented so that stopwords are removed after stemming). 
>> Then you just have to add 'developer' to the stop list, and you can 
>> forget about handling stemmed forms.
>>
>> Your method should also work though.
>>
>> - Mark
>>
>> Donna L Gresh wrote:
>>     
>>> Could those "in the know" comment on my current understanding of 
>>>       
> stemming 
>   
>>> and stopwords using the snowball analyzer?
>>>
>>> In my application, I am using the MoreLikeThis class to find similar 
>>> documents to an input "text blob". There are words in the input text 
>>>       
> blob 
>   
>>> which are "uninteresting" for my application, so I create a list of 
>>>       
> these 
>   
>>> words. These words are "uninteresting" no matter what their tense or 
>>> usage, for example, "develop", "developing", "developed", and 
>>>       
> "developer" 
>   
>>> are all uninteresting and I do not want them included in the search 
>>>       
> query 
>   
>>> created by the MoreLikeThis class.
>>>
>>> My index documents are stemmed using the Snowball analyzer. I do not 
>>>       
> use 
>   
>>> any stopwords when the documents are indexed (as I would like the 
>>>       
> choice 
>   
>>> of stopwords to be under user control at search time).
>>>
>>> I would like the user to be able to provide to the search application 
>>>       
> a 
>   
>>> list of "uninteresting" words, and for obvious reasons would like to 
>>>       
> force 
>   
>>> them to provide only, say, "developer" and have the application 
>>>       
> understand 
>   
>>> that all variants should be ignored (and I don't want to force them to 
>>>       
> try 
>   
>>> to guess what the stemmed version of "developer" is).
>>>
>>> My first try was to use MoreLikeThis with the Snowball analyzer and a 
>>> simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and 
>>> MoreLikeThis.setStopWords). However, it appears that the stopwords 
>>> provided to the MoreLikeThis class are compared in an exact way to the 
>>>       
>
>   
>>> token stream output by the Snowball filter (where the words have been 
>>> stemmed), so "developer" will not match anything, and all variants 
>>>       
> pass 
>   
>>> through. Even if I provide the list of unstemmed stopwords to the 
>>>       
> snowball 
>   
>>> analyzer instead, they are used "as-is" with no stemming performed, so 
>>>       
>
>   
>>> "developer" will not remove "developed". 
>>>
>>> Apparently the following is necessary for my application:
>>> Construct a snowball analyzer with no stopwords. Use the unstemmed 
>>> stopword list with the analyzer to construct a stemmed version of the 
>>>       
> set 
>   
>>> of stopwords. Use this set of stemmed stopwords as the stopwords input 
>>>       
> to 
>   
>>> the MoreLikeThis class (where the tokens are compared to the stemmed 
>>> versions after been output from the Snowball analyzer).
>>>
>>> Is my understanding correct?
>>>
>>> Donna
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>     
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: sanity check on how stemming, stopwords, and snowball analyzer works together

Posted by Donna L Gresh <gr...@us.ibm.com>.

I wasn't sure this:
Instead add the stopwords to the analyzer that 
> you pass to MoreLikeThis. That way you can ensure that the analyzer 
> applies the stopword list before stemming 

would work, because I don't want to provide all the variants of the 
stopword list-- if I do this, only the one provided will be removed, 
correct?


Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com


Mark Miller <ma...@gmail.com> wrote on 10/15/2007 10:37:22 AM:

> Sounds right to me.
> 
> The other option I think you have is to not use the MoreLikeThis 
> stopword functionality. Instead add the stopwords to the analyzer that 
> you pass to MoreLikeThis. That way you can ensure that the analyzer 
> applies the stopword list before stemming (The MoreLikeThis stopword 
> removal is implemented so that stopwords are removed after stemming). 
> Then you just have to add 'developer' to the stop list, and you can 
> forget about handling stemmed forms.
> 
> Your method should also work though.
> 
> - Mark
> 
> Donna L Gresh wrote:
> > Could those "in the know" comment on my current understanding of 
stemming 
> > and stopwords using the snowball analyzer?
> >
> > In my application, I am using the MoreLikeThis class to find similar 
> > documents to an input "text blob". There are words in the input text 
blob 
> > which are "uninteresting" for my application, so I create a list of 
these 
> > words. These words are "uninteresting" no matter what their tense or 
> > usage, for example, "develop", "developing", "developed", and 
"developer" 
> > are all uninteresting and I do not want them included in the search 
query 
> > created by the MoreLikeThis class.
> >
> > My index documents are stemmed using the Snowball analyzer. I do not 
use 
> > any stopwords when the documents are indexed (as I would like the 
choice 
> > of stopwords to be under user control at search time).
> >
> > I would like the user to be able to provide to the search application 
a 
> > list of "uninteresting" words, and for obvious reasons would like to 
force 
> > them to provide only, say, "developer" and have the application 
understand 
> > that all variants should be ignored (and I don't want to force them to 
try 
> > to guess what the stemmed version of "developer" is).
> >
> > My first try was to use MoreLikeThis with the Snowball analyzer and a 
> > simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and 
> > MoreLikeThis.setStopWords). However, it appears that the stopwords 
> > provided to the MoreLikeThis class are compared in an exact way to the 

> > token stream output by the Snowball filter (where the words have been 
> > stemmed), so "developer" will not match anything, and all variants 
pass 
> > through. Even if I provide the list of unstemmed stopwords to the 
snowball 
> > analyzer instead, they are used "as-is" with no stemming performed, so 

> > "developer" will not remove "developed". 
> >
> > Apparently the following is necessary for my application:
> > Construct a snowball analyzer with no stopwords. Use the unstemmed 
> > stopword list with the analyzer to construct a stemmed version of the 
set 
> > of stopwords. Use this set of stemmed stopwords as the stopwords input 
to 
> > the MoreLikeThis class (where the tokens are compared to the stemmed 
> > versions after been output from the Snowball analyzer).
> >
> > Is my understanding correct?
> >
> > Donna
> >
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: sanity check on how stemming, stopwords, and snowball analyzer works together

Posted by Mark Miller <ma...@gmail.com>.

Sounds right to me.

The other option I think you have is to not use the MoreLikeThis 
stopword functionality. Instead add the stopwords to the analyzer that 
you pass to MoreLikeThis. That way you can ensure that the analyzer 
applies the stopword list before stemming (The MoreLikeThis stopword 
removal is implemented so that stopwords are removed after stemming). 
Then you just have to add 'developer' to the stop list, and you can 
forget about handling stemmed forms.

Your method should also work though.

- Mark

Donna L Gresh wrote:
> Could those "in the know" comment on my current understanding of stemming 
> and stopwords using the snowball analyzer?
>
> In my application, I am using the MoreLikeThis class to find similar 
> documents to an input "text blob". There are words in the input text blob 
> which are "uninteresting" for my application, so I create a list of these 
> words. These words are "uninteresting" no matter what their tense or 
> usage, for example, "develop", "developing", "developed", and "developer" 
> are all uninteresting and I do not want them included in the search query 
> created by the MoreLikeThis class.
>
> My index documents are stemmed using the Snowball analyzer. I do not use 
> any stopwords when the documents are indexed (as I would like the choice 
> of stopwords to be under user control at search time).
>
> I would like the user to be able to provide to the search application a 
> list of "uninteresting" words, and for obvious reasons would like to force 
> them to provide only, say, "developer" and have the application understand 
> that all variants should be ignored (and I don't want to force them to try 
> to guess what the stemmed version of "developer" is).
>
> My first try was to use MoreLikeThis with the Snowball analyzer and a 
> simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and 
> MoreLikeThis.setStopWords). However, it appears that the stopwords 
> provided to the MoreLikeThis class are compared in an exact way to the 
> token stream output by the Snowball filter (where the words have been 
> stemmed), so "developer" will not match anything, and all variants pass 
> through. Even if I provide the list of unstemmed stopwords to the snowball 
> analyzer instead, they are used "as-is" with no stemming performed, so 
> "developer" will not remove "developed". 
>
> Apparently the following is necessary for my application:
> Construct a snowball analyzer with no stopwords. Use the unstemmed 
> stopword list with the analyzer to construct a stemmed version of the set 
> of stopwords. Use this set of stemmed stopwords as the stopwords input to 
> the MoreLikeThis class (where the tokens are compared to the stemmed 
> versions after been output from the Snowball analyzer).
>
> Is my understanding correct?
>
> Donna
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org