You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by ayyanar <ay...@aspiresys.com> on 2009/01/05 18:23:35 UTC
Tokenizer Question
I need a tokenizer that tokenizes a keyword as follows: Consider an example
"President day" - this should be tokenized as "President day", "President",
"Day"
This seems to be a functionality of a keyword tokenizer and whitespace
tokenizer
Do we have any tokenizer that does this job or we need to write a custom
one?
--
View this message in context: http://www.nabble.com/Tokenizer-Question-tp21295325p21295325.html
Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Tokenizer Question
Posted by Julio Oliveira <ju...@gmail.com>.
Do a while for a StringTokenized .
new StringTokenized (VarToTokenized," "); ( this return a list of tokens
with the words split by an space.
jOliveira
On Mon, Jan 5, 2009 at 7:58 PM, Steven A Rowe <sa...@syr.edu> wrote:
> Hi ayyanar,
>
> I should have mentioned in my previous email that the
> general@lucene.apache.org mailing list has very few subscribers - you'll
> get much better response on the java-user@l.a.o mailing list.
>
> On 01/05/2009 at 3:07 PM, ayyanar wrote:
> > My objective is to retain the keyword (input stream) as is a token like
> > a keyword tokenizer does and also split the keyword by whitespace and
> > maintain that tokens as a white space tokenizer does
>
> Right, ShingleFilter won't do this for you.
>
> The following, if used to filter WhitespaceTokenizer's output, is similar
> to what you want (note: untested, and also note that this assumes you're
> using Lucene v2.4.0, and not a recent trunk version, which includes the new
> TokenStream API introduced with LUCENE-1422: <
> https://issues.apache.org/jira/browse/LUCENE-1422>):
>
> -----
>
> /**
> * Extends CachingTokenFilter to output a space-separated-
> * concatenated-all-input-stream-terms token, followed by
> * all of the original input stream tokens.
> * One for all and (then) all for one!
> */
> public class ThreeMusketeersFilter extends CachingTokenFilter {
>
> private boolean concatenatedTokenOutput = false;
>
> public ThreeMusketeersFilter(TokenStream input) {
> super(input);
> }
>
> public Token next(final Token reusableToken) throws IOException {
> assert reusableToken != null;
> if (concatenatedTokenOutput) {
> return super.next(reusableToken);
> } else {
> concatenatedTokenOutput = true;
> Token firstToken = super.next(reusableToken);
> if (firstToken == null) {
> return null;
> }
> StringBuffer buffer = new StringBuffer();
> buffer.append(firstToken.termBuffer());
> int start = firstToken.startOffset();
> int end = firstToken.endOffset();
> for (Token nextToken = super.next(reusableToken) ;
> nextToken != null ;
> nextToken = super.next(reusableToken)) {
> end = nextToken.endOffset();
> buffer.append(' '); // add a space between terms
> buffer.append(nextToken.termBuffer());
> }
> reusableToken.clear();
> reusableToken.resizeTermBuffer(buffer.length());
> reusableToken.setTermLength(buffer.length());
> buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0);
> reusableToken.setStartOffset(start);
> reusableToken.setEndOffset(end);
> super.reset(); // Rewind input stream to get the individual tokens
> return reusableToken;
> }
> }
>
> public void reset() throws IOException {
> super.reset();
> concatenatedTokenOutput = false;
> }
> }
>
--
Saludos
Julio Oliveira - Buenos Aires
julio.julioOliveira@gmail.com
http://www.linkedin.com/in/juliomoliveira
RE: Tokenizer Question
Posted by Steven A Rowe <sa...@syr.edu>.
Hi ayyanar,
I should have mentioned in my previous email that the general@lucene.apache.org mailing list has very few subscribers - you'll get much better response on the java-user@l.a.o mailing list.
On 01/05/2009 at 3:07 PM, ayyanar wrote:
> My objective is to retain the keyword (input stream) as is a token like
> a keyword tokenizer does and also split the keyword by whitespace and
> maintain that tokens as a white space tokenizer does
Right, ShingleFilter won't do this for you.
The following, if used to filter WhitespaceTokenizer's output, is similar to what you want (note: untested, and also note that this assumes you're using Lucene v2.4.0, and not a recent trunk version, which includes the new TokenStream API introduced with LUCENE-1422: <https://issues.apache.org/jira/browse/LUCENE-1422>):
-----
/**
* Extends CachingTokenFilter to output a space-separated-
* concatenated-all-input-stream-terms token, followed by
* all of the original input stream tokens.
* One for all and (then) all for one!
*/
public class ThreeMusketeersFilter extends CachingTokenFilter {
private boolean concatenatedTokenOutput = false;
public ThreeMusketeersFilter(TokenStream input) {
super(input);
}
public Token next(final Token reusableToken) throws IOException {
assert reusableToken != null;
if (concatenatedTokenOutput) {
return super.next(reusableToken);
} else {
concatenatedTokenOutput = true;
Token firstToken = super.next(reusableToken);
if (firstToken == null) {
return null;
}
StringBuffer buffer = new StringBuffer();
buffer.append(firstToken.termBuffer());
int start = firstToken.startOffset();
int end = firstToken.endOffset();
for (Token nextToken = super.next(reusableToken) ;
nextToken != null ;
nextToken = super.next(reusableToken)) {
end = nextToken.endOffset();
buffer.append(' '); // add a space between terms
buffer.append(nextToken.termBuffer());
}
reusableToken.clear();
reusableToken.resizeTermBuffer(buffer.length());
reusableToken.setTermLength(buffer.length());
buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0);
reusableToken.setStartOffset(start);
reusableToken.setEndOffset(end);
super.reset(); // Rewind input stream to get the individual tokens
return reusableToken;
}
}
public void reset() throws IOException {
super.reset();
concatenatedTokenOutput = false;
}
}
RE: Tokenizer Question
Posted by ayyanar <ay...@aspiresys.com>.
My objective is to retain the keyword (input stream) as is a token like a
keyword tokenizer does and also split the keyword by whitespace and maintain
that tokens as a white space tokenizer does
--
View this message in context: http://www.nabble.com/Tokenizer-Question-tp21295325p21298291.html
Sent from the Lucene - General mailing list archive at Nabble.com.
RE: Tokenizer Question
Posted by Steven A Rowe <sa...@syr.edu>.
Hi ayyanar,
On 01/05/2009 at 12:23 PM, ayyanar wrote:
> I need a tokenizer that tokenizes a keyword as follows: Consider an
> example "President day" - this should be tokenized as "President day",
> "President", "Day" This seems to be a functionality of a keyword
> tokenizer and whitespace tokenizer Do we have any tokenizer that does
> this job or we need to write a custom one?
A ShingleFilter <http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/shingle/ShingleFilter.html> over a whitespace tokenizer should do the trick. By default, unigrams (individual terms) are output in addition to shingles (token n-grams).
Steve