You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by ayyanar <ay...@aspiresys.com> on 2009/01/05 18:23:35 UTC

Tokenizer Question

I need a tokenizer that tokenizes a keyword as follows: Consider an example
"President day" - this should be tokenized as "President day", "President",
"Day"
This seems to be a functionality of a keyword tokenizer and whitespace
tokenizer
Do we have any tokenizer that does this job or we need to write a custom
one?

-- 
View this message in context: http://www.nabble.com/Tokenizer-Question-tp21295325p21295325.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Tokenizer Question

Posted by Julio Oliveira <ju...@gmail.com>.
Do a while for a StringTokenized .
new StringTokenized (VarToTokenized," ");  ( this return a list of tokens
with the words split by an space.

jOliveira

On Mon, Jan 5, 2009 at 7:58 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi ayyanar,
>
> I should have mentioned in my previous email that the
> general@lucene.apache.org mailing list has very few subscribers - you'll
> get much better response on the java-user@l.a.o mailing list.
>
> On 01/05/2009 at 3:07 PM, ayyanar wrote:
> > My objective is to retain the keyword (input stream) as is a token like
> > a keyword tokenizer does and also split the keyword by whitespace and
> > maintain that tokens as a white space tokenizer does
>
> Right, ShingleFilter won't do this for you.
>
> The following, if used to filter WhitespaceTokenizer's output, is similar
> to what you want (note: untested, and also note that this assumes you're
> using Lucene v2.4.0, and not a recent trunk version, which includes the new
> TokenStream API introduced with LUCENE-1422: <
> https://issues.apache.org/jira/browse/LUCENE-1422>):
>
> -----
>
> /**
>  * Extends CachingTokenFilter to output a space-separated-
>  * concatenated-all-input-stream-terms token, followed by
>  * all of the original input stream tokens.
>  * One for all and (then) all for one!
>  */
> public class ThreeMusketeersFilter extends CachingTokenFilter {
>
>  private boolean concatenatedTokenOutput = false;
>
>  public ThreeMusketeersFilter(TokenStream input) {
>    super(input);
>  }
>
>  public Token next(final Token reusableToken) throws IOException {
>    assert reusableToken != null;
>    if (concatenatedTokenOutput) {
>        return super.next(reusableToken);
>    } else {
>      concatenatedTokenOutput = true;
>        Token firstToken = super.next(reusableToken);
>      if (firstToken == null) {
>        return null;
>      }
>      StringBuffer buffer = new StringBuffer();
>      buffer.append(firstToken.termBuffer());
>      int start = firstToken.startOffset();
>      int end = firstToken.endOffset();
>      for (Token nextToken = super.next(reusableToken) ;
>           nextToken != null ;
>           nextToken = super.next(reusableToken)) {
>        end = nextToken.endOffset();
>        buffer.append(' ');  // add a space between terms
>        buffer.append(nextToken.termBuffer());
>      }
>      reusableToken.clear();
>      reusableToken.resizeTermBuffer(buffer.length());
>      reusableToken.setTermLength(buffer.length());
>      buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0);
>      reusableToken.setStartOffset(start);
>      reusableToken.setEndOffset(end);
>      super.reset(); // Rewind input stream to get the individual tokens
>      return reusableToken;
>    }
>  }
>
>  public void reset() throws IOException {
>    super.reset();
>    concatenatedTokenOutput = false;
>  }
> }
>



-- 
Saludos

Julio Oliveira - Buenos Aires

julio.julioOliveira@gmail.com

http://www.linkedin.com/in/juliomoliveira

RE: Tokenizer Question

Posted by Steven A Rowe <sa...@syr.edu>.
Hi ayyanar,

I should have mentioned in my previous email that the general@lucene.apache.org mailing list has very few subscribers - you'll get much better response on the java-user@l.a.o mailing list.

On 01/05/2009 at 3:07 PM, ayyanar wrote:
> My objective is to retain the keyword (input stream) as is a token like
> a keyword tokenizer does and also split the keyword by whitespace and
> maintain that tokens as a white space tokenizer does

Right, ShingleFilter won't do this for you.

The following, if used to filter WhitespaceTokenizer's output, is similar to what you want (note: untested, and also note that this assumes you're using Lucene v2.4.0, and not a recent trunk version, which includes the new TokenStream API introduced with LUCENE-1422: <https://issues.apache.org/jira/browse/LUCENE-1422>):

-----

/**
 * Extends CachingTokenFilter to output a space-separated-
 * concatenated-all-input-stream-terms token, followed by
 * all of the original input stream tokens.
 * One for all and (then) all for one!
 */
public class ThreeMusketeersFilter extends CachingTokenFilter {

  private boolean concatenatedTokenOutput = false;

  public ThreeMusketeersFilter(TokenStream input) {
    super(input);
  }
  
  public Token next(final Token reusableToken) throws IOException {
    assert reusableToken != null;
    if (concatenatedTokenOutput) {
     	return super.next(reusableToken);
    } else {
      concatenatedTokenOutput = true;
    	Token firstToken = super.next(reusableToken);
      if (firstToken == null) {
        return null;
      }
      StringBuffer buffer = new StringBuffer();
      buffer.append(firstToken.termBuffer());
      int start = firstToken.startOffset();
      int end = firstToken.endOffset();
      for (Token nextToken = super.next(reusableToken) ;
           nextToken != null ;
           nextToken = super.next(reusableToken)) {
        end = nextToken.endOffset();
        buffer.append(' ');  // add a space between terms
        buffer.append(nextToken.termBuffer());
      }
      reusableToken.clear();
      reusableToken.resizeTermBuffer(buffer.length());
      reusableToken.setTermLength(buffer.length());
      buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0);
      reusableToken.setStartOffset(start);
      reusableToken.setEndOffset(end);
      super.reset(); // Rewind input stream to get the individual tokens
      return reusableToken;
    }
  }
  
  public void reset() throws IOException {
    super.reset();
    concatenatedTokenOutput = false;
  }
}

RE: Tokenizer Question

Posted by ayyanar <ay...@aspiresys.com>.
My objective is to retain the keyword (input stream) as is a token like a
keyword tokenizer does and also split the keyword by whitespace and maintain
that tokens as a white space tokenizer does
-- 
View this message in context: http://www.nabble.com/Tokenizer-Question-tp21295325p21298291.html
Sent from the Lucene - General mailing list archive at Nabble.com.


RE: Tokenizer Question

Posted by Steven A Rowe <sa...@syr.edu>.
Hi ayyanar,

On 01/05/2009 at 12:23 PM, ayyanar wrote:
> I need a tokenizer that tokenizes a keyword as follows: Consider an
> example "President day" - this should be tokenized as "President day",
> "President", "Day" This seems to be a functionality of a keyword
> tokenizer and whitespace tokenizer Do we have any tokenizer that does
> this job or we need to write a custom one?

A ShingleFilter <http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/shingle/ShingleFilter.html> over a whitespace tokenizer should do the trick.  By default, unigrams (individual terms) are output in addition to shingles (token n-grams).

Steve