You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Larry Hendrix <la...@wisc.edu> on 2010/05/18 20:05:11 UTC
Stemming Problem
Hi,
Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having problems with stemming. Does anyone have a recommendation for other text analyzers that handle stemming and also keep capitalization, stop words, and punctuation?
Thanks,
Larry
Larry A. Hendrix, Graduate Student
Computer Science Department
University of Wisconsin-Madison
1300 University Ave Rm 6749
Madison, WI 53711
Office: (608) 263-7624
lhendrix@cs.wisc.edu
Grambling State University Alum
RE: Stemming Problem
Posted by Christopher Condit <co...@sdsc.edu>.
Hi Larry-
> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
> problems with stemming. Does anyone have a recommendation for other
> text analyzers that handle stemming and also keep capitalization, stop words,
> and punctuation?
Have you tried the SnowballFilter? You could make your own analyzer combining a WhitespaceFilter and a SnowballFilter that should have the desired effect..
See: http://lucene.apache.org/java/3_0_1/api/contrib-snowball/org/apache/lucene/analysis/snowball/SnowballFilter.html
Good luck,
-Chris
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Stemming Problem
Posted by Larry Hendrix <la...@wisc.edu>.
Thanks for the advice. I want to keep the capitalization because in our application we are mining specific contact and company names from news articles. About 99% of the time if we match a contact or company and it's capitalized we avoid false matches.
--Larry
On May 18, 2010, at 7:46 PM, Erick Erickson wrote:
> You can construct your own analyzer by creating
> it from a pre-existing Tokenizer
> (e.g. WhiteSpaceTokenizer) and any number
> of TokenfFilters (e.g. TokenFilter). You can
> string any number of TokenFilters together
> to get many different effects.
>
> But I have to ask, why you want to keep capitalization?
> and punctuation? Do you really want to fail to match
> text indexed with "Erickson, Erick" with the query
> "erick erickson"? That's often a source of frustration
> instead of goodness.
>
> HTH
> Erick
>
> On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix <la...@wisc.edu> wrote:
>
>> Hi,
>>
>> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
>> problems with stemming. Does anyone have a recommendation for other text
>> analyzers that handle stemming and also keep capitalization, stop words, and
>> punctuation?
>>
>> Thanks,
>> Larry
>>
>>
>> Larry A. Hendrix, Graduate Student
>> Computer Science Department
>> University of Wisconsin-Madison
>> 1300 University Ave Rm 6749
>> Madison, WI 53711
>> Office: (608) 263-7624
>> lhendrix@cs.wisc.edu
>> Grambling State University Alum
>>
>>
Larry A. Hendrix, Graduate Student
Computer Science Department
University of Wisconsin-Madison
1300 University Ave Rm 6749
Madison, WI 53711
Office: (608) 263-7624
lhendrix@cs.wisc.edu
Grambling State University Alum
Re: Stemming Problem
Posted by Erick Erickson <er...@gmail.com>.
You can construct your own analyzer by creating
it from a pre-existing Tokenizer
(e.g. WhiteSpaceTokenizer) and any number
of TokenfFilters (e.g. TokenFilter). You can
string any number of TokenFilters together
to get many different effects.
But I have to ask, why you want to keep capitalization?
and punctuation? Do you really want to fail to match
text indexed with "Erickson, Erick" with the query
"erick erickson"? That's often a source of frustration
instead of goodness.
HTH
Erick
On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix <la...@wisc.edu> wrote:
> Hi,
>
> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
> problems with stemming. Does anyone have a recommendation for other text
> analyzers that handle stemming and also keep capitalization, stop words, and
> punctuation?
>
> Thanks,
> Larry
>
>
> Larry A. Hendrix, Graduate Student
> Computer Science Department
> University of Wisconsin-Madison
> 1300 University Ave Rm 6749
> Madison, WI 53711
> Office: (608) 263-7624
> lhendrix@cs.wisc.edu
> Grambling State University Alum
>
>