You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2004/03/01 21:47:11 UTC
StrlenFilter contribution and discussion
Out of curiosity - does anyone use a Filter based on string (token)
length. Use case is, say, you're indexing email msgs and if an
attachment is uuencoded into lines of 60 or whatever characters then you
don't want to index tokens that are so long as they can't possibly be
of use later and just eat up disk space.
Plz feel free to add this to sandbox with whatever license is appropriate.
The code is easy:
/**
* Removes words that are too long and too short from the stream
*/
public final class StrlenFilter
extends TokenFilter
{
/**
* Build a filter that removes words that are too long or too short
from the text.
*/
public StrlenFilter(TokenStream in, int min, int max)
{
input = in;
this.min = min;
this.max =max;
}
/** Returns the next input Token whose termText() is the right len
*/
public final Token next() throws IOException
{
// return the first non-stop word found
for (Token token = input.next(); token != null; token = input.next())
{
final int len = token.termText().length();
if ( len >= min && len <= max)
return token;
// note: else we ignore it but should we index each part of it?
}
// reached EOS -- return null
return null;
}
final int min;
final int max;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: StrlenFilter contribution and discussion
Posted by Otis Gospodnetic <ot...@yahoo.com>.
I remember you sending this once before....a long time ago.
This time I'll stick it in sandbox/contributions/analyzers/...
Thanks!
Otis
--- David Spencer <da...@tropo.com> wrote:
>
> Out of curiosity - does anyone use a Filter based on string (token)
> length. Use case is, say, you're indexing email msgs and if an
> attachment is uuencoded into lines of 60 or whatever characters then
> you
> don't want to index tokens that are so long as they can't possibly
> be
> of use later and just eat up disk space.
>
> Plz feel free to add this to sandbox with whatever license is
> appropriate.
>
> The code is easy:
>
> /**
> * Removes words that are too long and too short from the stream
> */
> public final class StrlenFilter
> extends TokenFilter
> {
> /**
> * Build a filter that removes words that are too long or too short
> from the text.
> */
> public StrlenFilter(TokenStream in, int min, int max)
> {
> input = in;
> this.min = min;
> this.max =max;
> }
>
> /** Returns the next input Token whose termText() is the right len
> */
> public final Token next() throws IOException
> {
> // return the first non-stop word found
> for (Token token = input.next(); token != null; token =
> input.next())
> {
> final int len = token.termText().length();
> if ( len >= min && len <= max)
> return token;
> // note: else we ignore it but should we index each part of it?
> }
> // reached EOS -- return null
> return null;
> }
> final int min;
> final int max;
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org