You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Scott Smith <ss...@mainstreamdata.com> on 2012/11/14 19:55:14 UTC
Which stemmer?
Does anyone have any experience with the stemmers? I know that Porter is what "everyone" uses. Am I better off with KStemFilter (better performance) or ?? Does anyone understand the differences between the various stemmers and how to choose one over another?
Re: Which stemmer?
Posted by Jack Krupansky <ja...@basetechnology.com>.
Great! For my favorite example of "invest", "invests", etc. it shows:
SnowballEnglish:
•investment
•invest
•invests
•investing
•invested
kStem:
•investors
•invest
•investor
•invests
•investing
•invested
minimalStem:invest
•invest
•invests
That highlights the distinctions between these stemmers quite well, without
highlighting the actual indexed term, which can be quite ugly.
-- Jack Krupansky
-----Original Message-----
From: Elmer van Chastelet
Sent: Wednesday, November 21, 2012 8:49 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?
I've just created a small web application which you might find useful.
You can see which words are matched by a query word when using different
analyzers (phonetic and stemming analyzers).
These include snowball, kstem and minimal stem (the ones on the right).
http://dutieq.st.ewi.tudelft.nl/wordsearch/
I can extend the app with more analyzers. Please let me know :)
--Elmer
Example
On 11/14/2012 07:55 PM, Scott Smith wrote:
> Does anyone have any experience with the stemmers? I know that Porter is
> what "everyone" uses. Am I better off with KStemFilter (better
> performance) or ?? Does anyone understand the differences between the
> various stemmers and how to choose one over another?
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by Elmer van Chastelet <ev...@gmail.com>.
I've just created a small web application which you might find useful.
You can see which words are matched by a query word when using different
analyzers (phonetic and stemming analyzers).
These include snowball, kstem and minimal stem (the ones on the right).
http://dutieq.st.ewi.tudelft.nl/wordsearch/
I can extend the app with more analyzers. Please let me know :)
--Elmer
Example
On 11/14/2012 07:55 PM, Scott Smith wrote:
> Does anyone have any experience with the stemmers? I know that Porter is what "everyone" uses. Am I better off with KStemFilter (better performance) or ?? Does anyone understand the differences between the various stemmers and how to choose one over another?
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by Michael Sokolov <so...@ifactory.com>.
>
> Does anyone have any experience with the stemmers? I know that Porter
> is what "everyone" uses. Am I better off with KStemFilter (better
> performance) or ?? Does anyone understand the differences between the
> various stemmers and how to choose one over another?
We started off using Porter, then switched to KStem since Porter is way
too aggressive for us (you get a lot of false matches), but KStem seemed
a little bit too conservative, so we've had to augment it with synonyms.
For example, KStem doesn't seem to reduce plurals in some cases where it
seems it should - like "bounds" was a problem - it won't match "bound,"
even though many (most) other plurals will match their singular form,
and verbs get reduced to their stems as well. I thought maybe this was
because there is also a heteronym (spelled same, different word) that is
*not* a plural or verb ("bounds" as boundary as in "out of bounds"??),
but I'm not really sure how KStem's word lists were put together or what
the goal was. Maybe this was ust an oversight?
YMMV; it depends a lot on what you are trying to achieve.
-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by Jack Krupansky <ja...@basetechnology.com>.
One other factor to keep in mind is that the customer should never "look" at
the actual stem term - such as "countri" or "gener" because in can freak
them out a little, for no good reason. I mean, the goal of stemming is to
show what set of words/terms will be treated as equivalent on a query, and
this is independent of what gets returned for a stored field. The stem is
simply the means to THAT end.
The fact that "dog" and "dogs" are not equivalent in KStem is in fact
disheartening, at least to me, but it may not be problematic in some use
cases.
-- Jack Krupansky
-----Original Message-----
From: Scott Smith
Sent: Thursday, November 15, 2012 11:57 AM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?
Thanks for the suggestions I think Erick is correct as well. I'll let the
customer decide.
Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I
changed the label. Interesting to see where the minimal stemmer and porter
agree (and KStemmer doesn't). You may also find the "dog" examples
interesting. I also found the "invest*" list entertaining.
original porter kstem EngMinStem
----------- ----------- ----------- -----------
country countri country country
countries countri country country
country's country' country's country'
run run run run
runs run runs run
running run running running
read read read read
reading read reading reading
reader reader reader reader
association associ association association
associate associ associate associate
listing list list listing
water water water water
watered water water watered
sure sure sure sure
surely sure surely surely
invest invest invest invest
investing invest invest investing
investment invest investment investment
investments invest investment investment
invests invest invest invest
investor investor invest investor
invester invest invest invester
investors investor invest investor
investers invest invest invester
organization organ organization organization
organize organ organize organize
organic organ organic organic
generous gener generous generous
generic gener generic generic
dog dog dog dog
dog's dog' dog's dog'
dogs dog dogs dog
dogs' dog dogs dog
Now, if someone would answer my question on the Solr list ("Custom Solr
Indexer/Search"), my day would be complete ;-).
Thanks for the continued help.
Scott
-----Original Message-----
From: Tom Burton-West [mailto:tburtonw@umich.edu]
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.
All stemmers both overstem and understem. Understemming means that some
forms of a word won't get searched. For example, without stemming,
searching for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming. The problem with aggressive stemmers like the
Porter stemmer, is that they overstem.
The original Porter stemmer for example would stem "organization" and "
organic" both to "organ" and "generalization" , "generous"and "generic" to "
gener" *
For background on the Porter stemmers and lots of examples see these pages:
http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*
*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>
This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing Morphology
as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth
Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, 191-203, 1993).
*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"
Tom
http://www.hathitrust.org/blogs/large-scale-search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Which stemmer?
Posted by Scott Smith <ss...@mainstreamdata.com>.
Thanks for the suggestions I think Erick is correct as well. I'll let the customer decide.
Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I changed the label. Interesting to see where the minimal stemmer and porter agree (and KStemmer doesn't). You may also find the "dog" examples interesting. I also found the "invest*" list entertaining.
original porter kstem EngMinStem
----------- ----------- ----------- -----------
country countri country country
countries countri country country
country's country' country's country'
run run run run
runs run runs run
running run running running
read read read read
reading read reading reading
reader reader reader reader
association associ association association
associate associ associate associate
listing list list listing
water water water water
watered water water watered
sure sure sure sure
surely sure surely surely
invest invest invest invest
investing invest invest investing
investment invest investment investment
investments invest investment investment
invests invest invest invest
investor investor invest investor
invester invest invest invester
investors investor invest investor
investers invest invest invester
organization organ organization organization
organize organ organize organize
organic organ organic organic
generous gener generous generous
generic gener generic generic
dog dog dog dog
dog's dog' dog's dog'
dogs dog dogs dog
dogs' dog dogs dog
Now, if someone would answer my question on the Solr list ("Custom Solr Indexer/Search"), my day would be complete ;-).
Thanks for the continued help.
Scott
-----Original Message-----
From: Tom Burton-West [mailto:tburtonw@umich.edu]
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?
I agree with Erick that you probably need to give your client a list of concrete examples, and perhaps to explain the trade-offs.
All stemmers both overstem and understem. Understemming means that some
forms of a word won't get searched. For example, without stemming, searching for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming increases overstemming. The problem with aggressive stemmers like the Porter stemmer, is that they overstem.
The original Porter stemmer for example would stem "organization" and " organic" both to "organ" and "generalization" , "generous"and "generic" to " gener" *
For background on the Porter stemmers and lots of examples see these pages:
http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*
*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>
This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191-203, 1993).
*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"
Tom
http://www.hathitrust.org/blogs/large-scale-search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by Dmitri Mamrukov <dy...@att.net>.
Sent from my iPhone
On Nov 16, 2012, at 7:18 PM, "Igal @ getRailo.org" <ig...@getrailo.org> wrote:
R
> This message cannot be displayed because of the way it is formatted. Ask the sender to send it again using a different format or email program. text/plainydckcu
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by Lance Norskog <go...@gmail.com>.
Nope! This slang term only exists in the plural. The kind of prose with this usage may not follow standard grammatical and spelling rules anyway. Historically, text search has been funded mostly by the US intelligence agencies because they want to analyze formal and technical prose. And, it is coded by people who think in good grammar, and are perfect spellers.
If you find 'too aggressive' and 'too mild' to be a problem, what you want is 'lemmatization' where you work from a dictionary of word forms. Solr supports using Wordnet for this purpose.
Lance
----- Original Message -----
| From: "Igal @ getRailo.org" <ig...@getrailo.org>
| To: java-user@lucene.apache.org
| Sent: Friday, November 16, 2012 4:18:20 PM
| Subject: Re: Which stemmer?
|
| but if "dogs" are feet (and I guess I fall into the not-perfect group
| here)... and "feet" is the plural form of "foot", then shouldn't
| "dogs"
| be stemmed to "dog" as a base, singular form?
|
|
|
| On 11/16/2012 2:32 PM, Tom Burton-West wrote:
| > Hi Mike,
| >
| >>> Honestly I've never heard of anyone using "dogs" to mean feet
| >>> either, but
| > hey nobody's perfect.
| >
| > This is really off topic but I couldn't resist. This usage of
| > "dogs" to
| > mean feet occurs in old blues lyrics such as Blind Lemon
| > Jefferson's "Hot
| > Dogs"
| > http://www.youtube.com/watch?v=v670qVwzm9c
| > (Hard to make out what he's singing on the old 78, but he's says
| > his "dogs"
| > is red hot, meaning he can run really fast.)
| > http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/
| >
| > Tom
| >
|
|
| ---------------------------------------------------------------------
| To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
| For additional commands, e-mail: java-user-help@lucene.apache.org
|
|
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by "Igal @ getRailo.org" <ig...@getrailo.org>.
but if "dogs" are feet (and I guess I fall into the not-perfect group
here)... and "feet" is the plural form of "foot", then shouldn't "dogs"
be stemmed to "dog" as a base, singular form?
On 11/16/2012 2:32 PM, Tom Burton-West wrote:
> Hi Mike,
>
>>> Honestly I've never heard of anyone using "dogs" to mean feet either, but
> hey nobody's perfect.
>
> This is really off topic but I couldn't resist. This usage of "dogs" to
> mean feet occurs in old blues lyrics such as Blind Lemon Jefferson's "Hot
> Dogs"
> http://www.youtube.com/watch?v=v670qVwzm9c
> (Hard to make out what he's singing on the old 78, but he's says his "dogs"
> is red hot, meaning he can run really fast.)
> http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/
>
> Tom
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by Tom Burton-West <tb...@umich.edu>.
Hi Mike,
>>Honestly I've never heard of anyone using "dogs" to mean feet either, but
hey nobody's perfect.
This is really off topic but I couldn't resist. This usage of "dogs" to
mean feet occurs in old blues lyrics such as Blind Lemon Jefferson's "Hot
Dogs"
http://www.youtube.com/watch?v=v670qVwzm9c
(Hard to make out what he's singing on the old 78, but he's says his "dogs"
is red hot, meaning he can run really fast.)
http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/
Tom
Re: Which stemmer?
Posted by Michael Sokolov <so...@ifactory.com>.
On 11/15/2012 1:06 PM, Tom Burton-West wrote:
> This paper on the Kstem stemmer lists cases where the Porter stemmer
> understems or overstems and explains the logic of Kstem: "Viewing
> Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the
> Sixteenth Annual International ACM SIGIR Conference on Research and
> Development in Information Retrieval, 191-203, 1993).
>
> *http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
> "
>
Thanks for the reference - that was very enlightening. The paper
explains why many terms are not stemmed as one might expect by KStem -
words that are found in the dictionary, by which I think they mean have
their own senses whose definitions do not include the stem word, are not
stemmed by KStem since it assumes that they have their own particular
meanings, and are not derived *purely by inflection*.
The dictionary they used is the Longman dictionary, which is available
for free online. I looked up "dog"
http://www.ldoceonline.com/dictionary/dog_1 and found that there is a
sense there (sense 13) whose definition reads:
dogs
[plural]American Englishinformalfeet:
this sense doesn't mention the stem word "dog" - it clearly has a
different meaning than the main dog entry, so I guess the thinking
behind this is: if the person was searching for "dogs" (meaning feet)
they wouldn't want to find text with "dog" (meaning man's best friend).
Of course in this case, "dog" singular presumably could mean foot as
well, so the inference seems faulty, although perhaps that never
occurs? Honestly I've never heard of anyone using "dogs" to mean feet
either, but hey nobody's perfect.
This entry: http://www.ldoceonline.com/dictionary/bound_4 probably
explains the reason "bounds" doesn't stem to "bound".
In the Lucene KStemmer code, this translates into the word appearing in
one of the dictionary data files. If a word appears there (as "dogs"
and "bounds" do), it won't be stemmed. I suppose a possible approach
here would be to send the client the dictionary of non-stemming words
and let them remove some, but then you'd have to compile your own
KStemmer variant.
Perhaps a nice feature to add to KStemmer would be to have it read a
list of exception words at run-time that would be removed from its
dictionary in order to allow them to be stemmed.
-Mike
Re: Which stemmer?
Posted by Michael Sokolov <so...@falutin.net>.
On 11/15/2012 1:06 PM, Tom Burton-West wrote:
> This paper on the Kstem stemmer lists cases where the Porter stemmer
> understems or overstems and explains the logic of Kstem: "Viewing
> Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the
> Sixteenth Annual International ACM SIGIR Conference on Research and
> Development in Information Retrieval, 191-203, 1993).
>
> *http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
> "
>
Thanks for the reference - that was very enlightening. The paper
explains why many terms are not stemmed as one might expect by KStem -
words that are found in the dictionary, by which I think they mean have
their own senses whose definitions do not include the stem word, are not
stemmed by KStem since it assumes that they have their own particular
meanings, and are not derived *purely by inflection*.
The dictionary they used is the Longman dictionary, which is available
for free online. I looked up "dog"
http://www.ldoceonline.com/dictionary/dog_1 and found that there is a
sense there (sense 13) whose definition reads:
dogs
[plural]American Englishinformalfeet:
this sense doesn't mention the stem word "dog" - it clearly has a
different meaning than the main dog entry, so I guess the thinking
behind this is: if the person was searching for "dogs" (meaning feet)
they wouldn't want to find text with "dog" (meaning man's best friend).
Of course in this case, "dog" singular presumably could mean foot as
well, so the inference seems faulty, although perhaps that never
occurs? Honestly I've never heard of anyone using "dogs" to mean feet
either, but hey nobody's perfect.
This entry: http://www.ldoceonline.com/dictionary/bound_4 probably
explains the reason "bounds" doesn't stem to "bound".
In the Lucene KStemmer code, this translates into the word appearing in
one of the dictionary data files. If a word appears there (as "dogs"
and "bounds" do), it won't be stemmed. I suppose a possible approach
here would be to send the client the dictionary of non-stemming words
and let them remove some, but then you'd have to compile your own
KStemmer variant.
Perhaps a nice feature to add to KStemmer would be to have it read a
list of exception words at run-time that would be removed from its
dictionary in order to allow them to be stemmed.
-Mike
Re: Which stemmer?
Posted by Tom Burton-West <tb...@umich.edu>.
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.
All stemmers both overstem and understem. Understemming means that some
forms of a word won’t get searched. For example, without stemming, searching
for “dogs” would not retrieve documents containing the word “dog”.
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming. The problem with aggressive stemmers like the
Porter stemmer, is that they overstem.
The original Porter stemmer for example would stem “organization” and “
organic” both to “organ” and “generalization” , “generous”and “generic” to “
gener” *
For background on the Porter stemmers and lots of examples see these pages:
http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*
*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>
This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).
*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"
Tom
http://www.hathitrust.org/blogs/large-scale-search
Re: Which stemmer?
Posted by Erick Erickson <er...@gmail.com>.
I'd make it easy for myself. Generate (programmatically), a list like you
showed for a _lot_ more terms, send it to your customer, and let _them_
pick. Unfortunately, the customer has no idea what "aggressive" means (for
that matter, I don't know how porter handles specific words for that
matter, I always have to try it). By putting concrete examples in front of
them, and framing it with "all the words that reduce to the same stem will
be considered matches and return" you can give them enough info to make a
choice.
FWIW,
Erick
On Wed, Nov 14, 2012 at 9:11 PM, Jack Krupansky <ja...@basetechnology.com>wrote:
> Another word set to try: invest, investing, investment, investments,
> invests, investor, invester, investors, investers.
>
> Also, take a look at EnglishMinimalStemmer (**
> EnglishMinimalStemFilterFactor**y) for minimal stemming.
>
> See:
> http://lucene.apache.org/core/**4_0_0/analyzers-common/org/**
> apache/lucene/analysis/en/**EnglishMinimalStemFilterFactor**y.html<http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilterFactory.html>
> http://lucene.apache.org/core/**4_0_0/analyzers-common/org/**
> apache/lucene/analysis/en/**EnglishMinimalStemmer.html<http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemmer.html>
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Scott Smith
> Sent: Wednesday, November 14, 2012 5:17 PM
> To: java-user@lucene.apache.org
> Subject: RE: Which stemmer?
>
>
> Unfortunately, my "use case" is a customer who wants stemming, but has
> very little knowledge of what that means except they think they want it.
>
> I agree with your last comment. So, here's my contribution:
>
> Original porter kstem minStem
> ------- ------- ------- -------
> country countri country country
> run run run run
> runs run runs run
> running run running running
> read read read read
> reading read reading reading
> reader reader reader reader
> association associ association association
> associate associ associate associate
> listing list list listing
> water water water water
> watered water water watered
> sure sure sure sure
> surely sure surely surely
> fred's fred' fred's fred'
> roses rose rose rose
>
> Still not sure which one to pick. Porter is more aggressive. Min stemmer
> is pretty minimal. Perhaps the kstemmer is "just right" :-)
>
> Cheers
>
> Scott
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.**com<ja...@basetechnology.com>
> ]
> Sent: Wednesday, November 14, 2012 4:14 PM
> To: java-user@lucene.apache.org
> Subject: Re: Which stemmer?
>
> What is your use case? If you don't have a specific use case in mind, try
> each of them with some common words that you expect will or won't be
> stemmed. If you have Solr, you can experiment interactively using the Solr
> Admin Analysis web page.
>
> It would be nice if the javadoc for each stemmer gave a handful of
> examples that illustrated how some common words are stemmed.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Scott Smith
> Sent: Wednesday, November 14, 2012 10:55 AM
> To: java-user@lucene.apache.org
> Subject: Which stemmer?
>
> Does anyone have any experience with the stemmers? I know that Porter is
> what "everyone" uses. Am I better off with KStemFilter (better
> performance) or ?? Does anyone understand the differences between the
> various stemmers and how to choose one over another?
>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>
Re: Which stemmer?
Posted by Jack Krupansky <ja...@basetechnology.com>.
Another word set to try: invest, investing, investment, investments,
invests, investor, invester, investors, investers.
Also, take a look at EnglishMinimalStemmer (EnglishMinimalStemFilterFactory)
for minimal stemming.
See:
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilterFactory.html
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemmer.html
-- Jack Krupansky
-----Original Message-----
From: Scott Smith
Sent: Wednesday, November 14, 2012 5:17 PM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?
Unfortunately, my "use case" is a customer who wants stemming, but has very
little knowledge of what that means except they think they want it.
I agree with your last comment. So, here's my contribution:
Original porter kstem minStem
------- ------- ------- -------
country countri country country
run run run run
runs run runs run
running run running running
read read read read
reading read reading reading
reader reader reader reader
association associ association association
associate associ associate associate
listing list list listing
water water water water
watered water water watered
sure sure sure sure
surely sure surely surely
fred's fred' fred's fred'
roses rose rose rose
Still not sure which one to pick. Porter is more aggressive. Min stemmer
is pretty minimal. Perhaps the kstemmer is "just right" :-)
Cheers
Scott
-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Wednesday, November 14, 2012 4:14 PM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?
What is your use case? If you don't have a specific use case in mind, try
each of them with some common words that you expect will or won't be
stemmed. If you have Solr, you can experiment interactively using the Solr
Admin Analysis web page.
It would be nice if the javadoc for each stemmer gave a handful of examples
that illustrated how some common words are stemmed.
-- Jack Krupansky
-----Original Message-----
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?
Does anyone have any experience with the stemmers? I know that Porter is
what "everyone" uses. Am I better off with KStemFilter (better performance)
or ?? Does anyone understand the differences between the various stemmers
and how to choose one over another?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Which stemmer?
Posted by Scott Smith <ss...@mainstreamdata.com>.
Unfortunately, my "use case" is a customer who wants stemming, but has very little knowledge of what that means except they think they want it.
I agree with your last comment. So, here's my contribution:
Original porter kstem minStem
------- ------- ------- -------
country countri country country
run run run run
runs run runs run
running run running running
read read read read
reading read reading reading
reader reader reader reader
association associ association association
associate associ associate associate
listing list list listing
water water water water
watered water water watered
sure sure sure sure
surely sure surely surely
fred's fred' fred's fred'
roses rose rose rose
Still not sure which one to pick. Porter is more aggressive. Min stemmer is pretty minimal. Perhaps the kstemmer is "just right" :-)
Cheers
Scott
-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Wednesday, November 14, 2012 4:14 PM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?
What is your use case? If you don't have a specific use case in mind, try each of them with some common words that you expect will or won't be stemmed. If you have Solr, you can experiment interactively using the Solr Admin Analysis web page.
It would be nice if the javadoc for each stemmer gave a handful of examples that illustrated how some common words are stemmed.
-- Jack Krupansky
-----Original Message-----
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?
Does anyone have any experience with the stemmers? I know that Porter is what "everyone" uses. Am I better off with KStemFilter (better performance) or ?? Does anyone understand the differences between the various stemmers and how to choose one over another?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Which stemmer?
Posted by Jack Krupansky <ja...@basetechnology.com>.
What is your use case? If you don't have a specific use case in mind, try
each of them with some common words that you expect will or won't be
stemmed. If you have Solr, you can experiment interactively using the Solr
Admin Analysis web page.
It would be nice if the javadoc for each stemmer gave a handful of examples
that illustrated how some common words are stemmed.
-- Jack Krupansky
-----Original Message-----
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?
Does anyone have any experience with the stemmers? I know that Porter is
what "everyone" uses. Am I better off with KStemFilter (better performance)
or ?? Does anyone understand the differences between the various stemmers
and how to choose one over another?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org