You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2004/03/01 21:47:11 UTC

StrlenFilter contribution and discussion

Out of curiosity - does anyone use a Filter based on string (token) 
length. Use case is, say, you're indexing email msgs and if an 
attachment is uuencoded into lines of 60 or whatever characters then you 
  don't want to index tokens that are so long as they can't possibly be 
of use later and just eat up disk space.

Plz feel free to add this to sandbox with whatever license is appropriate.

The code is easy:

/**
  * Removes words that are too long and too short from the stream
  */
public final class StrlenFilter
	extends TokenFilter
{
	/**
	 * Build a filter that removes words that are too long or too short 
from the text.
	 */
	public StrlenFilter(TokenStream in, int min, int max)
	{
		input = in;
		this.min = min;
		this.max =max;
	}

	/** Returns the next input Token whose termText() is the right len
	 */
	public final Token next() throws IOException
	{
		// return the first non-stop word found
		for (Token token = input.next(); token != null; token = input.next())
		{
			final int len = token.termText().length();
			if ( len >= min && len <= max)
				return token;
			// note: else we ignore it but should we index each part of it?
		}
		// reached EOS -- return null		
		return null;
	}
	final int min;
	final int max;
}

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: StrlenFilter contribution and discussion

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I remember you sending this once before....a long time ago.
This time I'll stick it in sandbox/contributions/analyzers/...
Thanks!

Otis

--- David Spencer <da...@tropo.com> wrote:
> 
> Out of curiosity - does anyone use a Filter based on string (token) 
> length. Use case is, say, you're indexing email msgs and if an 
> attachment is uuencoded into lines of 60 or whatever characters then
> you 
>   don't want to index tokens that are so long as they can't possibly
> be 
> of use later and just eat up disk space.
> 
> Plz feel free to add this to sandbox with whatever license is
> appropriate.
> 
> The code is easy:
> 
> /**
>   * Removes words that are too long and too short from the stream
>   */
> public final class StrlenFilter
> 	extends TokenFilter
> {
> 	/**
> 	 * Build a filter that removes words that are too long or too short 
> from the text.
> 	 */
> 	public StrlenFilter(TokenStream in, int min, int max)
> 	{
> 		input = in;
> 		this.min = min;
> 		this.max =max;
> 	}
> 
> 	/** Returns the next input Token whose termText() is the right len
> 	 */
> 	public final Token next() throws IOException
> 	{
> 		// return the first non-stop word found
> 		for (Token token = input.next(); token != null; token =
> input.next())
> 		{
> 			final int len = token.termText().length();
> 			if ( len >= min && len <= max)
> 				return token;
> 			// note: else we ignore it but should we index each part of it?
> 		}
> 		// reached EOS -- return null		
> 		return null;
> 	}
> 	final int min;
> 	final int max;
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org