You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by eh...@apache.org on 2004/03/10 10:59:57 UTC
cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
ehatcher 2004/03/10 01:59:57
Modified: src/java/org/apache/lucene/analysis StopFilter.java
Log:
another refinement to the ongoing StopFilter refactorings
Revision Changes Path
1.8 +7 -7 jakarta-lucene/src/java/org/apache/lucene/analysis/StopFilter.java
Index: StopFilter.java
===================================================================
RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/analysis/StopFilter.java,v
retrieving revision 1.7
retrieving revision 1.8
diff -u -r1.7 -r1.8
--- StopFilter.java 10 Mar 2004 00:18:02 -0000 1.7
+++ StopFilter.java 10 Mar 2004 09:59:57 -0000 1.8
@@ -65,7 +65,7 @@
public final class StopFilter extends TokenFilter {
- private Set table;
+ private Set stopWords;
/**
* Constructs a filter which removes words from the input
@@ -73,7 +73,7 @@
*/
public StopFilter(TokenStream in, String[] stopWords) {
super(in);
- table = makeStopSet(stopWords);
+ this.stopWords = makeStopSet(stopWords);
}
/**
@@ -84,16 +84,16 @@
*/
public StopFilter(TokenStream in, Hashtable stopTable) {
super(in);
- table = stopTable.keySet();
+ stopWords = new HashSet(stopTable.keySet());
}
/**
* Constructs a filter which removes words from the input
* TokenStream that are named in the Set.
*/
- public StopFilter(TokenStream in, Set stopTable) {
+ public StopFilter(TokenStream in, Set stopWords) {
super(in);
- table = stopTable;
+ this.stopWords = new HashSet(stopWords);
}
/**
@@ -114,7 +114,7 @@
/**
* Builds a Set from an array of stop words,
* appropriate for passing into the StopFilter constructor.
- * This permits this table construction to be cached once when
+ * This permits this stopWords construction to be cached once when
* an Analyzer is constructed.
*/
public static final Set makeStopSet(String[] stopWords) {
@@ -130,7 +130,7 @@
public final Token next() throws IOException {
// return the first non-stop word found
for (Token token = input.next(); token != null; token = input.next())
- if (!table.contains(token.termText))
+ if (!stopWords.contains(token.termText))
return token;
// reached EOS -- return null
return null;
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 10, 2004, at 1:08 PM, Doug Cutting wrote:
> ehatcher@apache.org wrote:
>> - public StopFilter(TokenStream in, Set stopTable) {
>> + public StopFilter(TokenStream in, Set stopWords) {
>> super(in);
>> - table = stopTable;
>> + this.stopWords = new HashSet(stopWords);
>> }
>
> This always allocates a new HashSet, which, if the stop list is large,
> and documents are small, could impact performance.
Ok, after some more thinking on this, part of the dilemma is also that
analyzers generally construct all of the tokenizers/tokenfilters in the
tokenStream method. It would seem better for them to keep instance
variables for all the non-variant pieces.
With the change to HashSet, any custom analyzers (once the dust settles
on this change, I'll convert the built-in code to use the new methods)
will be using the Hashtable ctor thinking it is the most efficient one
and now it is not. Is this a problem?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Posted by Otis Gospodnetic <ot...@yahoo.com>.
HashSet is fine, and the best we've currently got, but let the caller
of StopFilter choose that. Other, better, Set implementations may be
widely available one day.
Otis
--- Erik Hatcher <er...@ehatchersolutions.com> wrote:
> This, too, is my preferred approach.... it just seemed such a strong
> argument for HashSet everywhere. I particularly agree with the lack
>
> of need to enforce HashSet.
>
> Erik
>
> On Mar 11, 2004, at 8:29 AM, Otis Gospodnetic wrote:
>
> > I agree, partially, with Doug about not copying thing and that use
> of
> > instanceof.
> >
> > The part where I don't agree is where I agree with what Scott Ganyo
> > said, and with Erik's initial approach: use interfaces. I don't
> see a
> > need to epxose that HashSet. Just use Set.
> >
> > Well, maybe not even an internal HashSet enforcement needs to be
> made.
> > Why not leave it up to the caller to pick the Set implementation
> that
> > it wants to use? Why enforce it in StopFilter?
> >
> > I'm for:
> >
> > public StopFilter(TokenStream in, Set stopWords) {
> > super(in);
> > this.stopWords = stopWords;
> >
> >
> > Otis (didn't follow the discussion closely, sorry if I repeated
> > somebody else's words or if I'm way off) Gospodnetic
> >
> >
> >
> > --- Doug Cutting <cu...@apache.org> wrote:
> >> ehatcher@apache.org wrote:
> >>> - public StopFilter(TokenStream in, Set stopTable) {
> >>> + public StopFilter(TokenStream in, Set stopWords) {
> >>> super(in);
> >>> - table = stopTable;
> >>> + this.stopWords = new HashSet(stopWords);
> >>> }
> >>
> >> This always allocates a new HashSet, which, if the stop list is
> >> large,
> >> and documents are small, could impact performance.
> >>
> >> Perhaps we can replace this with something like:
> >>
> >> public StopFilter(TokenStream in, Set stopWords) {
> >> this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
> >> : new HashSet(stopWords));
> >> }
> >>
> >> and then add another constructor:
> >>
> >> private StopFilter(TokenStream in, HashSet stopWords) {
> >> super(in);
> >> this.stopWords = stopTable;
> >> }
> >>
> >> Also, if we want the implementation to always be a HashSet
> >> internally,
> >> for performance, we ought to declare the field to be a HashSet,
> no?
> >>
> >> The competing goals here are:
> >> 1. Not to expose publicly the implementation of the Set;
> >> 2. Not to copy the contents of the Set when folks pass the
> value
> >> of
> >> makeStopSet.
> >> 3. Use the most efficient implementation internally.
> >>
> >> I think the changes above meet all of these.
> >>
> >> Doug
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
This, too, is my preferred approach.... it just seemed such a strong
argument for HashSet everywhere. I particularly agree with the lack
of need to enforce HashSet.
Erik
On Mar 11, 2004, at 8:29 AM, Otis Gospodnetic wrote:
> I agree, partially, with Doug about not copying thing and that use of
> instanceof.
>
> The part where I don't agree is where I agree with what Scott Ganyo
> said, and with Erik's initial approach: use interfaces. I don't see a
> need to epxose that HashSet. Just use Set.
>
> Well, maybe not even an internal HashSet enforcement needs to be made.
> Why not leave it up to the caller to pick the Set implementation that
> it wants to use? Why enforce it in StopFilter?
>
> I'm for:
>
> public StopFilter(TokenStream in, Set stopWords) {
> super(in);
> this.stopWords = stopWords;
>
>
> Otis (didn't follow the discussion closely, sorry if I repeated
> somebody else's words or if I'm way off) Gospodnetic
>
>
>
> --- Doug Cutting <cu...@apache.org> wrote:
>> ehatcher@apache.org wrote:
>>> - public StopFilter(TokenStream in, Set stopTable) {
>>> + public StopFilter(TokenStream in, Set stopWords) {
>>> super(in);
>>> - table = stopTable;
>>> + this.stopWords = new HashSet(stopWords);
>>> }
>>
>> This always allocates a new HashSet, which, if the stop list is
>> large,
>> and documents are small, could impact performance.
>>
>> Perhaps we can replace this with something like:
>>
>> public StopFilter(TokenStream in, Set stopWords) {
>> this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>> : new HashSet(stopWords));
>> }
>>
>> and then add another constructor:
>>
>> private StopFilter(TokenStream in, HashSet stopWords) {
>> super(in);
>> this.stopWords = stopTable;
>> }
>>
>> Also, if we want the implementation to always be a HashSet
>> internally,
>> for performance, we ought to declare the field to be a HashSet, no?
>>
>> The competing goals here are:
>> 1. Not to expose publicly the implementation of the Set;
>> 2. Not to copy the contents of the Set when folks pass the value
>> of
>> makeStopSet.
>> 3. Use the most efficient implementation internally.
>>
>> I think the changes above meet all of these.
>>
>> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Posted by Otis Gospodnetic <ot...@yahoo.com>.
I agree, partially, with Doug about not copying thing and that use of
instanceof.
The part where I don't agree is where I agree with what Scott Ganyo
said, and with Erik's initial approach: use interfaces. I don't see a
need to epxose that HashSet. Just use Set.
Well, maybe not even an internal HashSet enforcement needs to be made.
Why not leave it up to the caller to pick the Set implementation that
it wants to use? Why enforce it in StopFilter?
I'm for:
public StopFilter(TokenStream in, Set stopWords) {
super(in);
this.stopWords = stopWords;
Otis (didn't follow the discussion closely, sorry if I repeated
somebody else's words or if I'm way off) Gospodnetic
--- Doug Cutting <cu...@apache.org> wrote:
> ehatcher@apache.org wrote:
> > - public StopFilter(TokenStream in, Set stopTable) {
> > + public StopFilter(TokenStream in, Set stopWords) {
> > super(in);
> > - table = stopTable;
> > + this.stopWords = new HashSet(stopWords);
> > }
>
> This always allocates a new HashSet, which, if the stop list is
> large,
> and documents are small, could impact performance.
>
> Perhaps we can replace this with something like:
>
> public StopFilter(TokenStream in, Set stopWords) {
> this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
> : new HashSet(stopWords));
> }
>
> and then add another constructor:
>
> private StopFilter(TokenStream in, HashSet stopWords) {
> super(in);
> this.stopWords = stopTable;
> }
>
> Also, if we want the implementation to always be a HashSet
> internally,
> for performance, we ought to declare the field to be a HashSet, no?
>
> The competing goals here are:
> 1. Not to expose publicly the implementation of the Set;
> 2. Not to copy the contents of the Set when folks pass the value
> of
> makeStopSet.
> 3. Use the most efficient implementation internally.
>
> I think the changes above meet all of these.
>
> Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis
StopFilter.java
Posted by Doug Cutting <cu...@apache.org>.
ehatcher@apache.org wrote:
> - public StopFilter(TokenStream in, Set stopTable) {
> + public StopFilter(TokenStream in, Set stopWords) {
> super(in);
> - table = stopTable;
> + this.stopWords = new HashSet(stopWords);
> }
This always allocates a new HashSet, which, if the stop list is large,
and documents are small, could impact performance.
Perhaps we can replace this with something like:
public StopFilter(TokenStream in, Set stopWords) {
this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
: new HashSet(stopWords));
}
and then add another constructor:
private StopFilter(TokenStream in, HashSet stopWords) {
super(in);
this.stopWords = stopTable;
}
Also, if we want the implementation to always be a HashSet internally,
for performance, we ought to declare the field to be a HashSet, no?
The competing goals here are:
1. Not to expose publicly the implementation of the Set;
2. Not to copy the contents of the Set when folks pass the value of
makeStopSet.
3. Use the most efficient implementation internally.
I think the changes above meet all of these.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org