You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vinicius Carvalho <vi...@gmail.com> on 2010/06/23 04:49:45 UTC

Stop words filter

Hello there! I've been using lucene as a Fult Text Search solution for some
time. And  although I'm familiar with Analyzers and Stemmers I never used
them directly.

I'm testing a few experiments on Sentiment Analysis and our implementation
needs to perform stemming and stop word removal. I thought using lucene
built-in support to spare me some coding time.

Is there any example? I'm trying

TokenStream stream = analyzer.tokenStream("", new StringReader(inputStr));

Problem is that I could not find a way to get the result tokens. I was
expecting something like stream.getTokens:Token[] :P

Could someone point me in the right direction?

Regards

-- 
The intuitive mind is a sacred gift and the
rational mind is a faithful servant. We have
created a society that honors the servant and
has forgotten the gift.

Re: Stop words filter

Posted by Erick Erickson <er...@gmail.com>.

On the chance that this is an XY problem
(http://people.apache.org/~hossman/#xyproblem),
why can't you use StopFilter and PorterStemFilter in
your filter chain rather than try to do this yourself?

Best
Erick

On Tue, Jun 22, 2010 at 10:49 PM, Vinicius Carvalho <
viniciusccarvalho@gmail.com> wrote:

> Hello there! I've been using lucene as a Fult Text Search solution for some
> time. And  although I'm familiar with Analyzers and Stemmers I never used
> them directly.
>
> I'm testing a few experiments on Sentiment Analysis and our implementation
> needs to perform stemming and stop word removal. I thought using lucene
> built-in support to spare me some coding time.
>
> Is there any example? I'm trying
>
> TokenStream stream = analyzer.tokenStream("", new StringReader(inputStr));
>
> Problem is that I could not find a way to get the result tokens. I was
> expecting something like stream.getTokens:Token[] :P
>
> Could someone point me in the right direction?
>
> Regards
>
> --
> The intuitive mind is a sacred gift and the
> rational mind is a faithful servant. We have
> created a society that honors the servant and
> has forgotten the gift.
>

Re: Stop words filter

Posted by Rebecca Watson <be...@gmail.com>.

i guess you are using lucene 2.9 or below if you're talking about
Tokens still...

here's some old code i used to use (not sure if i wrote it or grabbed it from
online examples - its been a while since i used it!)
that grabbed the set of tokens given field name +
text to analyse (for any class that extended it.... e.g. use it for
per field analyzer
too):

public abstract class GenAnalyzer extends Analyzer {
	
	/**
	 * lucene Analyzer object
	 * @see org.apache.lucene.analysis.Analyzer
	 */
	protected Analyzer gan;
	
	/*
	 * A method to split text into tokens which are returned in the form of
	 * a TokenStream object. The text is read in using the java.io.Reader
	 * object. As analysers can be field specific the name of the field
	 * is also provided to the method.
	 *
	 * @see org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String,
java.io.Reader)
	 * @param fieldName the name of the lucene field
	 * @param reader A Reader object containing string to split into tokens
	 * @return a TokenStream that represents the string split into tokens
based on the _
	 * field name (maybe field specific analyser).
	 */
	@Override
	public TokenStream tokenStream(String fieldName, Reader reader) {
		return gan.tokenStream(fieldName, reader);
	}
	
	/**
	 * A method to split text into tokens which are returned in the form of
	 * a Token[]. The text is read in as a string.
	 * As analysers can be field specific the name of the field
	 * is also provided to the method.
	 *
	 * similar to tokenStream method accept that the parameters
	 * and return type differ.
	 *
	 * @param fieldName the name of the lucene field
	 * @param text the text to be split into tokens
	 * @return a Token[] which represents the split text tokens.
	 * @throws IOException maybe thrown by stream.next(token) call.
	 *
	 * @see org.apache.lucene.analysis.Token
	 */
	public Token[] getTokens(String fieldName, String text)
	throws IOException {
		TokenStream stream = gan.tokenStream(fieldName, new StringReader(text));
		ArrayList<Token> tokenList = new ArrayList<Token>();
		Token token = new Token();
		while(true){
			token = stream.next(token);
			if (token == null) break;
			tokenList.add((Token) token.clone());
		}
		//stream.end();
		return tokenList.toArray(new Token[0]);
	}
}

hope that helps, i haven't used this code for a while but it worked
when i used it last!

in lucene 2.9 the stream.next(token) method is deprecated... and
if you move to lucene 3 i think that's where the attributesources replace tokens
so all this code will need to be ported...

thanks :)

bec

On 23 June 2010 10:49, Vinicius Carvalho <vi...@gmail.com> wrote:
> Hello there! I've been using lucene as a Fult Text Search solution for some
> time. And  although I'm familiar with Analyzers and Stemmers I never used
> them directly.
>
> I'm testing a few experiments on Sentiment Analysis and our implementation
> needs to perform stemming and stop word removal. I thought using lucene
> built-in support to spare me some coding time.
>
> Is there any example? I'm trying
>
> TokenStream stream = analyzer.tokenStream("", new StringReader(inputStr));
>
> Problem is that I could not find a way to get the result tokens. I was
> expecting something like stream.getTokens:Token[] :P
>
> Could someone point me in the right direction?
>
> Regards
>
> --
> The intuitive mind is a sacred gift and the
> rational mind is a faithful servant. We have
> created a society that honors the servant and
> has forgotten the gift.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Stop words filter

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Vinicius,

You should read the Package-Level Docs:
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/packa
ge-summary.html

To get the Token attributes, you have to add Attributes to your TokenStream
using addAttribute() and then you have easy access to the various attributes
of each token, when iterating with incrementToken().

If you want to program an own Tokenizer, start by inspecting a provided one
and do it similar. Also test cases for existing analyzers are a good way to
look into the usage. A good method to test TokenStream/Analyzers are in the
test-package's class BaseTokenStreamTestCase: assertTokenStreamContents(),
assertAnalyzesTo().

Also the Lucene In Action *2* book gives good examples.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Vinicius Carvalho [mailto:viniciusccarvalho@gmail.com]
> Sent: Wednesday, June 23, 2010 4:50 AM
> To: java-user@lucene.apache.org
> Subject: Stop words filter
> 
> Hello there! I've been using lucene as a Fult Text Search solution for
some
> time. And  although I'm familiar with Analyzers and Stemmers I never used
> them directly.
> 
> I'm testing a few experiments on Sentiment Analysis and our
> implementation needs to perform stemming and stop word removal. I
> thought using lucene built-in support to spare me some coding time.
> 
> Is there any example? I'm trying
> 
> TokenStream stream = analyzer.tokenStream("", new
> StringReader(inputStr));
> 
> Problem is that I could not find a way to get the result tokens. I was
> expecting something like stream.getTokens:Token[] :P
> 
> Could someone point me in the right direction?
> 
> Regards
> 
> --
> The intuitive mind is a sacred gift and the rational mind is a faithful
servant.
> We have created a society that honors the servant and has forgotten the
gift.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org