You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jong Kim <jk...@sitescape.com> on 2007/07/08 23:12:08 UTC

Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project

Hi,
 
The MoreLikeThis class in Lucene's contrib/queries project performs noise
word filtering based on the case-sensitive comparison of the terms against
the user-supplied stopwords set. 
 
I need this comparison to be case-insensitive, but I don't see any way of
achieving it by extending this class. I would have created a subclass of
MoreLikeThis and override the isNoiseWord() method. However, the problem is
that, neither isNoiseWord() method nor the instance variables referenced
inside that method are declared protected. They are all private. Has anyone
solved this problem without modifying and building MoreLikeThis class
directly?
 
An alternative approach would be to supply a stopwords list containing all
variants of the stop words with all possible mixed cases. Needless to say,
that isn't likely to be a workable solution for many.
 
Ultimately it would be nice if those methods and variables would have been
made protected so that applications could override some of the default
behaviors without having to modify the class directly.
 
Any help would be appreciated.
 
Thanks
/Jong

Re: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project

Posted by Chris Hostetter <ho...@fucit.org>.
: I need this comparison to be case-insensitive, but I don't see any way of
: achieving it by extending this class. I would have created a subclass of
: MoreLikeThis and override the isNoiseWord() method. However, the problem is
: that, neither isNoiseWord() method nor the instance variables referenced
: inside that method are declared protected. They are all private. Has anyone

a more direct problem for you in this appraoch would be that the entire
class is final.

: solved this problem without modifying and building MoreLikeThis class
: directly?
:
: An alternative approach would be to supply a stopwords list containing all
: variants of the stop words with all possible mixed cases. Needless to say,

it looks like MLT doesn't do any processing of the Set you pass to
setStopWords ... the only method it calls is contains(String) so as long
as you only ever put lower case terms in your set you could easily do
something like...

   Set stopWords = new HashSet() {
     public boolean contains(Object o) {
       return super.contains(o.toString().toLowerCase());
     }
   }

: Ultimately it would be nice if those methods and variables would have been
: made protected so that applications could override some of the default
: behaviors without having to modify the class directly.

agreed ... patches welcome :)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org