You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Wolfgang Hoschek <wh...@lbl.gov> on 2005/05/03 22:26:33 UTC
contrib: keywordTokenStream
Here's a convenience add-on method to MemoryIndex. If it turns out that
this could be of wider use, it could be moved into the core analysis
package. For the moment the MemoryIndex might be a better home.
Opinions, anyone?
Wolfgang.
/**
* Convenience method; Creates and returns a token stream that
generates a
* token for each keyword in the given collection, "as is", without any
* transforming text analysis. The resulting token stream can be fed
into
* {@link #addField(String, TokenStream)}, perhaps wrapped into another
* {@link org.apache.lucene.analysis.TokenFilter}, as desired.
*
* @param keywords
* the keywords to generate tokens for
* @return the corresponding token stream
*/
public TokenStream keywordTokenStream(final Collection keywords) {
if (keywords == null)
throw new IllegalArgumentException("keywords must not be null");
return new TokenStream() {
Iterator iter = keywords.iterator();
int pos = 0;
int start = 0;
public Token next() {
if (!iter.hasNext()) return null;
Object obj = iter.next();
if (obj == null)
throw new IllegalArgumentException("keyword must not be null");
String term = obj.toString();
Token token = new Token(term, start, start + term.length());
start += term.length() + 1; // separate words by 1 (blank) character
pos++;
return token;
}
};
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: contrib: keywordTokenStream
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
On May 3, 2005, at 5:26 PM, Erik Hatcher wrote:
> Wolfgang,
>
> I've now added this.
Thanks :-)
> I'm not seeing how this could be generally useful. I'm curious how
> you are using it and why it is better suited for what you're doing
> than any other analyzer.
>
> "keyword tokenizer" is a bit overloaded terminology-wise, though -
> look in the contrib/analyzers/src/java area to see what I mean.
>
> Erik
The difference between this and the KeywordTokenizer from the
contrib/analyzer is that it
- can operate on multiple keywords rather than just a single one. So
it's slighly more general.
- Takes a collection (typically of String values) as a input rather
than a Reader. I can see the java.io.Reader scalability rationale used
throughout the analysis APIs, but for many use cases (including my own)
Strings are a lot handier (and more efficient to deal with) - the
string values are small anyway.
So it's a convenient way to add terms (keywords if you like) that have
been parsed/massaged into string(s) by some existing external means
(e.g. grouped regex scanning of legacy formatted text files into
various fields, etc) into an index "as is", without any further
transforming analysis. Most folks could write such a (non-essential)
utility themselves but it's handy in a similar way that you have the
Field.Keyword convenience infrastructure...
> "keyword tokenizer" is a bit overloaded terminology-wise, though
If you come up with a better name feel free to rename it.
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: contrib: keywordTokenStream
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Wolfgang,
I've now added this. I'm not seeing how this could be generally
useful. I'm curious how you are using it and why it is better suited
for what you're doing than any other analyzer.
"keyword tokenizer" is a bit overloaded terminology-wise, though -
look in the contrib/analyzers/src/java area to see what I mean.
Erik
On May 3, 2005, at 4:26 PM, Wolfgang Hoschek wrote:
> Here's a convenience add-on method to MemoryIndex. If it turns out
> that this could be of wider use, it could be moved into the core
> analysis package. For the moment the MemoryIndex might be a better
> home. Opinions, anyone?
>
> Wolfgang.
>
> /**
> * Convenience method; Creates and returns a token stream that
> generates a
> * token for each keyword in the given collection, "as is",
> without any
> * transforming text analysis. The resulting token stream can
> be fed into
> * {@link #addField(String, TokenStream)}, perhaps wrapped into
> another
> * {@link org.apache.lucene.analysis.TokenFilter}, as desired.
> *
> * @param keywords
> * the keywords to generate tokens for
> * @return the corresponding token stream
> */
> public TokenStream keywordTokenStream(final Collection keywords) {
> if (keywords == null)
> throw new IllegalArgumentException("keywords must not
> be null");
>
> return new TokenStream() {
> Iterator iter = keywords.iterator();
> int pos = 0;
> int start = 0;
> public Token next() {
> if (!iter.hasNext()) return null;
>
> Object obj = iter.next();
> if (obj == null)
> throw new IllegalArgumentException("keyword
> must not be null");
>
> String term = obj.toString();
> Token token = new Token(term, start, start +
> term.length());
> start += term.length() + 1; // separate words by 1
> (blank) character
> pos++;
> return token;
> }
> };
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org