You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Larry Hendrix <la...@wisc.edu> on 2010/05/18 20:05:11 UTC

Stemming Problem

Hi,

Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having problems with stemming. Does anyone have a recommendation for other text analyzers that handle stemming and also keep capitalization, stop words, and punctuation?

Thanks,
Larry


Larry A. Hendrix, Graduate Student 
Computer Science Department 
University of Wisconsin-Madison 
1300 University Ave Rm 6749 
Madison, WI 53711 
Office: (608) 263-7624 
lhendrix@cs.wisc.edu 
Grambling State University Alum 


RE: Stemming Problem

Posted by Christopher Condit <co...@sdsc.edu>.
Hi Larry-
 
> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
> problems with stemming. Does anyone have a recommendation for other
> text analyzers that handle stemming and also keep capitalization, stop words,
> and punctuation?

Have you tried the SnowballFilter? You could make your own analyzer combining a WhitespaceFilter and a SnowballFilter that should have the desired effect..
See: http://lucene.apache.org/java/3_0_1/api/contrib-snowball/org/apache/lucene/analysis/snowball/SnowballFilter.html

Good luck,
-Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Stemming Problem

Posted by Larry Hendrix <la...@wisc.edu>.
Thanks for the advice. I want to keep the capitalization because in our application we are mining specific contact and company names from news articles. About 99% of the time if we match a contact or company and it's capitalized we avoid false matches.

--Larry

On May 18, 2010, at 7:46 PM, Erick Erickson wrote:

> You can construct your own analyzer by creating
> it from a pre-existing Tokenizer
> (e.g. WhiteSpaceTokenizer) and any number
> of TokenfFilters (e.g. TokenFilter). You can
> string any number of TokenFilters together
> to get many different effects.
> 
> But I have to ask, why you want to keep capitalization?
> and punctuation? Do you really want to fail to match
> text indexed with "Erickson, Erick" with the query
> "erick erickson"? That's often a source of frustration
> instead of goodness.
> 
> HTH
> Erick
> 
> On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix <la...@wisc.edu> wrote:
> 
>> Hi,
>> 
>> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
>> problems with stemming. Does anyone have a recommendation for other text
>> analyzers that handle stemming and also keep capitalization, stop words, and
>> punctuation?
>> 
>> Thanks,
>> Larry
>> 
>> 
>> Larry A. Hendrix, Graduate Student
>> Computer Science Department
>> University of Wisconsin-Madison
>> 1300 University Ave Rm 6749
>> Madison, WI 53711
>> Office: (608) 263-7624
>> lhendrix@cs.wisc.edu
>> Grambling State University Alum
>> 
>> 

Larry A. Hendrix, Graduate Student 
Computer Science Department 
University of Wisconsin-Madison 
1300 University Ave Rm 6749 
Madison, WI 53711 
Office: (608) 263-7624 
lhendrix@cs.wisc.edu 
Grambling State University Alum 


Re: Stemming Problem

Posted by Erick Erickson <er...@gmail.com>.
You can construct your own analyzer by creating
it from a pre-existing Tokenizer
(e.g. WhiteSpaceTokenizer) and any number
of TokenfFilters (e.g. TokenFilter). You can
string any number of TokenFilters together
to get many different effects.

But I have to ask, why you want to keep capitalization?
and punctuation? Do you really want to fail to match
text indexed with "Erickson, Erick" with the query
"erick erickson"? That's often a source of frustration
instead of goodness.

HTH
Erick

On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix <la...@wisc.edu> wrote:

> Hi,
>
> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
> problems with stemming. Does anyone have a recommendation for other text
> analyzers that handle stemming and also keep capitalization, stop words, and
> punctuation?
>
> Thanks,
> Larry
>
>
> Larry A. Hendrix, Graduate Student
> Computer Science Department
> University of Wisconsin-Madison
> 1300 University Ave Rm 6749
> Madison, WI 53711
> Office: (608) 263-7624
> lhendrix@cs.wisc.edu
> Grambling State University Alum
>
>