You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by eh...@apache.org on 2004/03/10 10:59:57 UTC

cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java

ehatcher    2004/03/10 01:59:57

  Modified:    src/java/org/apache/lucene/analysis StopFilter.java
  Log:
  another refinement to the ongoing StopFilter refactorings
  
  Revision  Changes    Path
  1.8       +7 -7      jakarta-lucene/src/java/org/apache/lucene/analysis/StopFilter.java
  
  Index: StopFilter.java
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/analysis/StopFilter.java,v
  retrieving revision 1.7
  retrieving revision 1.8
  diff -u -r1.7 -r1.8
  --- StopFilter.java	10 Mar 2004 00:18:02 -0000	1.7
  +++ StopFilter.java	10 Mar 2004 09:59:57 -0000	1.8
  @@ -65,7 +65,7 @@
   
   public final class StopFilter extends TokenFilter {
   
  -  private Set table;
  +  private Set stopWords;
   
     /**
      * Constructs a filter which removes words from the input
  @@ -73,7 +73,7 @@
      */
     public StopFilter(TokenStream in, String[] stopWords) {
       super(in);
  -    table = makeStopSet(stopWords);
  +    this.stopWords = makeStopSet(stopWords);
     }
   
     /**
  @@ -84,16 +84,16 @@
      */
     public StopFilter(TokenStream in, Hashtable stopTable) {
       super(in);
  -    table = stopTable.keySet();
  +    stopWords = new HashSet(stopTable.keySet());
     }
   
     /**
      * Constructs a filter which removes words from the input
      * TokenStream that are named in the Set.
      */
  -  public StopFilter(TokenStream in, Set stopTable) {
  +  public StopFilter(TokenStream in, Set stopWords) {
       super(in);
  -    table = stopTable;
  +    this.stopWords = new HashSet(stopWords);
     }
   
     /**
  @@ -114,7 +114,7 @@
     /**
      * Builds a Set from an array of stop words,
      * appropriate for passing into the StopFilter constructor.
  -   * This permits this table construction to be cached once when
  +   * This permits this stopWords construction to be cached once when
      * an Analyzer is constructed.
      */
     public static final Set makeStopSet(String[] stopWords) {
  @@ -130,7 +130,7 @@
     public final Token next() throws IOException {
       // return the first non-stop word found
       for (Token token = input.next(); token != null; token = input.next())
  -      if (!table.contains(token.termText))
  +      if (!stopWords.contains(token.termText))
           return token;
       // reached EOS -- return null
       return null;
  
  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 10, 2004, at 1:08 PM, Doug Cutting wrote:
> ehatcher@apache.org wrote:
>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>        super(in);
>>   -    table = stopTable;
>>   +    this.stopWords = new HashSet(stopWords);
>>      }
>
> This always allocates a new HashSet, which, if the stop list is large, 
> and documents are small, could impact performance.

Ok, after some more thinking on this, part of the dilemma is also that 
analyzers generally construct all of the tokenizers/tokenfilters in the 
tokenStream method.  It would seem better for them to keep instance 
variables for all the non-variant pieces.

With the change to HashSet, any custom analyzers (once the dust settles 
on this change, I'll convert the built-in code to use the new methods) 
will be using the Hashtable ctor thinking it is the most efficient one 
and now it is not.  Is this a problem?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java

Posted by Otis Gospodnetic <ot...@yahoo.com>.
HashSet is fine, and the best we've currently got, but let the caller
of StopFilter choose that.  Other, better, Set implementations may be
widely available one day.

Otis

--- Erik Hatcher <er...@ehatchersolutions.com> wrote:
> This, too, is my preferred approach.... it just seemed such a strong 
> argument for HashSet everywhere.   I particularly agree with the lack
> 
> of need to enforce HashSet.
> 
> 	Erik
> 
> On Mar 11, 2004, at 8:29 AM, Otis Gospodnetic wrote:
> 
> > I agree, partially, with Doug about not copying thing and that use
> of
> > instanceof.
> >
> > The part where I don't agree is where I agree with what Scott Ganyo
> > said, and with Erik's initial approach: use interfaces.  I don't
> see a
> > need to epxose that HashSet.  Just use Set.
> >
> > Well, maybe not even an internal HashSet enforcement needs to be
> made.
> > Why not leave it up to the caller to pick the Set implementation
> that
> > it wants to use?  Why enforce it in StopFilter?
> >
> > I'm for:
> >
> > public StopFilter(TokenStream in, Set stopWords) {
> >     super(in);
> >     this.stopWords = stopWords;
> >
> >
> > Otis (didn't follow the discussion closely, sorry if I repeated
> > somebody else's words or if I'm way off) Gospodnetic
> >
> >
> >
> > --- Doug Cutting <cu...@apache.org> wrote:
> >> ehatcher@apache.org wrote:
> >>>   -  public StopFilter(TokenStream in, Set stopTable) {
> >>>   +  public StopFilter(TokenStream in, Set stopWords) {
> >>>        super(in);
> >>>   -    table = stopTable;
> >>>   +    this.stopWords = new HashSet(stopWords);
> >>>      }
> >>
> >> This always allocates a new HashSet, which, if the stop list is
> >> large,
> >> and documents are small, could impact performance.
> >>
> >> Perhaps we can replace this with something like:
> >>
> >> public StopFilter(TokenStream in, Set stopWords) {
> >>    this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
> >>             : new HashSet(stopWords));
> >> }
> >>
> >> and then add another constructor:
> >>
> >> private StopFilter(TokenStream in, HashSet stopWords) {
> >>    super(in);
> >>    this.stopWords = stopTable;
> >> }
> >>
> >> Also, if we want the implementation to always be a HashSet
> >> internally,
> >> for performance, we ought to declare the field to be a HashSet,
> no?
> >>
> >> The competing goals here are:
> >>    1. Not to expose publicly the implementation of the Set;
> >>    2. Not to copy the contents of the Set when folks pass the
> value
> >> of
> >> makeStopSet.
> >>    3. Use the most efficient implementation internally.
> >>
> >> I think the changes above meet all of these.
> >>
> >> Doug
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
This, too, is my preferred approach.... it just seemed such a strong 
argument for HashSet everywhere.   I particularly agree with the lack 
of need to enforce HashSet.

	Erik

On Mar 11, 2004, at 8:29 AM, Otis Gospodnetic wrote:

> I agree, partially, with Doug about not copying thing and that use of
> instanceof.
>
> The part where I don't agree is where I agree with what Scott Ganyo
> said, and with Erik's initial approach: use interfaces.  I don't see a
> need to epxose that HashSet.  Just use Set.
>
> Well, maybe not even an internal HashSet enforcement needs to be made.
> Why not leave it up to the caller to pick the Set implementation that
> it wants to use?  Why enforce it in StopFilter?
>
> I'm for:
>
> public StopFilter(TokenStream in, Set stopWords) {
>     super(in);
>     this.stopWords = stopWords;
>
>
> Otis (didn't follow the discussion closely, sorry if I repeated
> somebody else's words or if I'm way off) Gospodnetic
>
>
>
> --- Doug Cutting <cu...@apache.org> wrote:
>> ehatcher@apache.org wrote:
>>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>>        super(in);
>>>   -    table = stopTable;
>>>   +    this.stopWords = new HashSet(stopWords);
>>>      }
>>
>> This always allocates a new HashSet, which, if the stop list is
>> large,
>> and documents are small, could impact performance.
>>
>> Perhaps we can replace this with something like:
>>
>> public StopFilter(TokenStream in, Set stopWords) {
>>    this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>>             : new HashSet(stopWords));
>> }
>>
>> and then add another constructor:
>>
>> private StopFilter(TokenStream in, HashSet stopWords) {
>>    super(in);
>>    this.stopWords = stopTable;
>> }
>>
>> Also, if we want the implementation to always be a HashSet
>> internally,
>> for performance, we ought to declare the field to be a HashSet, no?
>>
>> The competing goals here are:
>>    1. Not to expose publicly the implementation of the Set;
>>    2. Not to copy the contents of the Set when folks pass the value
>> of
>> makeStopSet.
>>    3. Use the most efficient implementation internally.
>>
>> I think the changes above meet all of these.
>>
>> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I agree, partially, with Doug about not copying thing and that use of
instanceof.

The part where I don't agree is where I agree with what Scott Ganyo
said, and with Erik's initial approach: use interfaces.  I don't see a
need to epxose that HashSet.  Just use Set.

Well, maybe not even an internal HashSet enforcement needs to be made. 
Why not leave it up to the caller to pick the Set implementation that
it wants to use?  Why enforce it in StopFilter?

I'm for:

public StopFilter(TokenStream in, Set stopWords) {
    super(in);
    this.stopWords = stopWords;


Otis (didn't follow the discussion closely, sorry if I repeated
somebody else's words or if I'm way off) Gospodnetic



--- Doug Cutting <cu...@apache.org> wrote:
> ehatcher@apache.org wrote:
> >   -  public StopFilter(TokenStream in, Set stopTable) {
> >   +  public StopFilter(TokenStream in, Set stopWords) {
> >        super(in);
> >   -    table = stopTable;
> >   +    this.stopWords = new HashSet(stopWords);
> >      }
> 
> This always allocates a new HashSet, which, if the stop list is
> large, 
> and documents are small, could impact performance.
> 
> Perhaps we can replace this with something like:
> 
> public StopFilter(TokenStream in, Set stopWords) {
>    this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>             : new HashSet(stopWords));
> }
> 
> and then add another constructor:
> 
> private StopFilter(TokenStream in, HashSet stopWords) {
>    super(in);
>    this.stopWords = stopTable;
> }
> 
> Also, if we want the implementation to always be a HashSet
> internally, 
> for performance, we ought to declare the field to be a HashSet, no?
> 
> The competing goals here are:
>    1. Not to expose publicly the implementation of the Set;
>    2. Not to copy the contents of the Set when folks pass the value
> of 
> makeStopSet.
>    3. Use the most efficient implementation internally.
> 
> I think the changes above meet all of these.
> 
> Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java

Posted by Doug Cutting <cu...@apache.org>.
ehatcher@apache.org wrote:
>   -  public StopFilter(TokenStream in, Set stopTable) {
>   +  public StopFilter(TokenStream in, Set stopWords) {
>        super(in);
>   -    table = stopTable;
>   +    this.stopWords = new HashSet(stopWords);
>      }

This always allocates a new HashSet, which, if the stop list is large, 
and documents are small, could impact performance.

Perhaps we can replace this with something like:

public StopFilter(TokenStream in, Set stopWords) {
   this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
            : new HashSet(stopWords));
}

and then add another constructor:

private StopFilter(TokenStream in, HashSet stopWords) {
   super(in);
   this.stopWords = stopTable;
}

Also, if we want the implementation to always be a HashSet internally, 
for performance, we ought to declare the field to be a HashSet, no?

The competing goals here are:
   1. Not to expose publicly the implementation of the Set;
   2. Not to copy the contents of the Set when folks pass the value of 
makeStopSet.
   3. Use the most efficient implementation internally.

I think the changes above meet all of these.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org