You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Daniel Naber <lu...@danielnaber.de> on 2005/07/15 13:50:07 UTC

getting Analyzer's stop words

Hi,

I'd like to add the following extension to the abstract analyzer class:

  public abstract Set getStopwords();

This method returns the stop words in use. Subclasses that don't use stop 
words at all will have to return an empty HashSet (or null?).

An interesting question is how PerFieldAnalyzerWrapper could implement this 
method. I think it should return the union of all its analyzers' stop words.

Regards
 Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: getting Analyzer's stop words

Posted by Daniel Naber <lu...@danielnaber.de>.
On Friday 15 July 2005 14:33, Erik Hatcher wrote:

> > This method returns the stop words in use. Subclasses that don't  
> > use stop
> > words at all will have to return an empty HashSet (or null?).
> >
> > An interesting question is how PerFieldAnalyzerWrapper could  
> > implement this
> > method. I think it should return the union of all its analyzers'  
> > stop words.
>
> What use case do you have in mind for this feature?

I need to do some complicated query-rewriting with an analyzer that doesn't 
change the words but uses the stop words from other analyzers. Well, I've 
now locally introduced my own analyzer that extends Analyzer and that 
looks like the right solution.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: getting Analyzer's stop words

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jul 15, 2005, at 7:50 AM, Daniel Naber wrote:
> I'd like to add the following extension to the abstract analyzer  
> class:
>
>   public abstract Set getStopwords();
>
> This method returns the stop words in use. Subclasses that don't  
> use stop
> words at all will have to return an empty HashSet (or null?).
>
> An interesting question is how PerFieldAnalyzerWrapper could  
> implement this
> method. I think it should return the union of all its analyzers'  
> stop words.

What use case do you have in mind for this feature?

I personally find this an extremely awkward proposal.  Stop words may  
be field-specific, or may be dynamic.  For example, what about a  
MinLengthFilter under an analyzer.  Would all words that get removed  
by an analyzer be considered a "stop word"?  The idea of removing  
stop words is very questionable, especially in the academic scholarly  
domain where I'm applying Lucene.  Just the idea of having words  
removed from searching causes scholars to scream!  :)  So I don't see  
stop words as a universal analyzer concept at all.

Perhaps there could be a subclass of Analyzer that is designed for  
stop word removal and the StopAnalyzer and StandardAnalyzer subclass  
from it.  If you're handed an Analyzer instance and need to know  
whether it removes stop words or not, you could do an "instance of  
StopWordRemovalAnalyzer".  Perhaps an interface should be used  
instead.  Either way, I don't see that method being appropriate at  
the Analyzer base class level.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org