You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Wolfgang Hoschek <wh...@lbl.gov> on 2005/05/03 22:26:33 UTC

contrib: keywordTokenStream

Here's a convenience add-on method to MemoryIndex. If it turns out that 
this could be of wider use, it could be moved into the core analysis 
package. For the moment the MemoryIndex might be a better home. 
Opinions, anyone?

Wolfgang.

	/**
	 * Convenience method; Creates and returns a token stream that 
generates a
	 * token for each keyword in the given collection, "as is", without any
	 * transforming text analysis. The resulting token stream can be fed 
into
	 * {@link #addField(String, TokenStream)}, perhaps wrapped into another
	 * {@link org.apache.lucene.analysis.TokenFilter}, as desired.
	 *
	 * @param keywords
	 *            the keywords to generate tokens for
	 * @return the corresponding token stream
	 */
	public TokenStream keywordTokenStream(final Collection keywords) {
		if (keywords == null)
			throw new IllegalArgumentException("keywords must not be null");
		
		return new TokenStream() {
			Iterator iter = keywords.iterator();
			int pos = 0;
			int start = 0;
			public Token next() {
				if (!iter.hasNext()) return null;
				
				Object obj = iter.next();
				if (obj == null)
					throw new IllegalArgumentException("keyword must not be null");
				
				String term = obj.toString();
				Token token = new Token(term, start, start + term.length());
				start += term.length() + 1; // separate words by 1 (blank) character
				pos++;
				return token;
			}
		};
	}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: contrib: keywordTokenStream

Posted by Wolfgang Hoschek <wh...@lbl.gov>.

On May 3, 2005, at 5:26 PM, Erik Hatcher wrote:

> Wolfgang,
>
> I've now added this.

Thanks :-)

> I'm not seeing how this could be generally useful.  I'm curious how 
> you are using it and why it is better suited for what you're doing 
> than any other analyzer.
>
> "keyword tokenizer" is a bit overloaded terminology-wise, though - 
> look in the contrib/analyzers/src/java area to see what I mean.
>
>     Erik

The difference between this and the KeywordTokenizer from the 
contrib/analyzer is that it

- can operate on multiple keywords rather than just a single one. So 
it's slighly more general.
- Takes a collection (typically of String values) as a input rather 
than a Reader. I can see the java.io.Reader scalability rationale used 
throughout the analysis APIs, but for many use cases (including my own) 
Strings are a lot handier (and more efficient to deal with) - the 
string values are small anyway.

So it's a convenient way to add terms (keywords if you like) that have 
been parsed/massaged into string(s) by some existing external means 
(e.g. grouped regex scanning of legacy formatted text files into 
various fields, etc) into an index "as is", without any further 
transforming analysis. Most folks could write such a (non-essential) 
utility themselves but it's handy in a similar way that you have the 
Field.Keyword convenience infrastructure...

> "keyword tokenizer" is a bit overloaded terminology-wise, though

If you come up with a better name feel free to rename it.

Wolfgang.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: contrib: keywordTokenStream

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Wolfgang,

I've now added this.  I'm not seeing how this could be generally  
useful.  I'm curious how you are using it and why it is better suited  
for what you're doing than any other analyzer.

"keyword tokenizer" is a bit overloaded terminology-wise, though -  
look in the contrib/analyzers/src/java area to see what I mean.

     Erik

On May 3, 2005, at 4:26 PM, Wolfgang Hoschek wrote:

> Here's a convenience add-on method to MemoryIndex. If it turns out  
> that this could be of wider use, it could be moved into the core  
> analysis package. For the moment the MemoryIndex might be a better  
> home. Opinions, anyone?
>
> Wolfgang.
>
>     /**
>      * Convenience method; Creates and returns a token stream that  
> generates a
>      * token for each keyword in the given collection, "as is",  
> without any
>      * transforming text analysis. The resulting token stream can  
> be fed into
>      * {@link #addField(String, TokenStream)}, perhaps wrapped into  
> another
>      * {@link org.apache.lucene.analysis.TokenFilter}, as desired.
>      *
>      * @param keywords
>      *            the keywords to generate tokens for
>      * @return the corresponding token stream
>      */
>     public TokenStream keywordTokenStream(final Collection keywords) {
>         if (keywords == null)
>             throw new IllegalArgumentException("keywords must not  
> be null");
>
>         return new TokenStream() {
>             Iterator iter = keywords.iterator();
>             int pos = 0;
>             int start = 0;
>             public Token next() {
>                 if (!iter.hasNext()) return null;
>
>                 Object obj = iter.next();
>                 if (obj == null)
>                     throw new IllegalArgumentException("keyword  
> must not be null");
>
>                 String term = obj.toString();
>                 Token token = new Token(term, start, start +  
> term.length());
>                 start += term.length() + 1; // separate words by 1  
> (blank) character
>                 pos++;
>                 return token;
>             }
>         };
>     }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org