You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Scott Smith <ss...@mainstreamdata.com> on 2012/11/14 19:55:14 UTC

Which stemmer?

Does anyone have any experience with the stemmers?  I know that Porter is what "everyone" uses.  Am I better off with KStemFilter (better performance) or ??  Does anyone understand the differences between the various stemmers and how to choose one over another?

Re: Which stemmer?

Posted by Jack Krupansky <ja...@basetechnology.com>.
Great! For my favorite example of "invest", "invests", etc. it shows:

SnowballEnglish:
•investment
•invest
•invests
•investing
•invested

kStem:
•investors
•invest
•investor
•invests
•investing
•invested

minimalStem:invest
•invest
•invests

That highlights the distinctions between these stemmers quite well, without 
highlighting the actual indexed term, which can be quite ugly.

-- Jack Krupansky

-----Original Message----- 
From: Elmer van Chastelet
Sent: Wednesday, November 21, 2012 8:49 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I've just created a small web application which you might find useful.
You can see which words are matched by a query word when using different
analyzers  (phonetic and stemming analyzers).
These include snowball, kstem and minimal stem (the ones on the right).

http://dutieq.st.ewi.tudelft.nl/wordsearch/

I can extend the app with more analyzers. Please let me know :)

--Elmer

Example

On 11/14/2012 07:55 PM, Scott Smith wrote:
> Does anyone have any experience with the stemmers?  I know that Porter is 
> what "everyone" uses.  Am I better off with KStemFilter (better 
> performance) or ??  Does anyone understand the differences between the 
> various stemmers and how to choose one over another?
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by Elmer van Chastelet <ev...@gmail.com>.
I've just created a small web application which you might find useful.
You can see which words are matched by a query word when using different 
analyzers  (phonetic and stemming analyzers).
These include snowball, kstem and minimal stem (the ones on the right).

http://dutieq.st.ewi.tudelft.nl/wordsearch/

I can extend the app with more analyzers. Please let me know :)

--Elmer

Example

On 11/14/2012 07:55 PM, Scott Smith wrote:
> Does anyone have any experience with the stemmers?  I know that Porter is what "everyone" uses.  Am I better off with KStemFilter (better performance) or ??  Does anyone understand the differences between the various stemmers and how to choose one over another?
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by Michael Sokolov <so...@ifactory.com>.
>
> Does anyone have any experience with the stemmers?  I know that Porter 
> is what "everyone" uses.  Am I better off with KStemFilter (better 
> performance) or ??  Does anyone understand the differences between the 
> various stemmers and how to choose one over another?
We started off using Porter, then switched to KStem since Porter is way 
too aggressive for us (you get a lot of false matches), but KStem seemed 
a little bit too conservative, so we've had to augment it with synonyms.

For example, KStem doesn't seem to reduce plurals in some cases where it 
seems it should - like "bounds" was a problem - it won't match "bound," 
even though many (most) other plurals will match their singular form, 
and verbs get reduced to their stems as well. I thought maybe this was 
because there is also a heteronym (spelled same, different word) that is 
*not* a plural or verb ("bounds" as boundary as in "out of bounds"??), 
but I'm not really sure how KStem's word lists were put together or what 
the goal was.  Maybe this was ust an oversight?

YMMV; it depends a lot on what you are trying to achieve.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by Jack Krupansky <ja...@basetechnology.com>.
One other factor to keep in mind is that the customer should never "look" at 
the actual stem term - such as "countri" or "gener" because in can freak 
them out a little, for no good reason. I mean, the goal of stemming is to 
show what set of words/terms will be treated as equivalent on a query, and 
this is independent of what gets returned for a stored field. The stem is 
simply the means to THAT end.

The fact that "dog" and "dogs" are not equivalent in KStem is in fact 
disheartening, at least to me, but it may not be problematic in some use 
cases.

-- Jack Krupansky

-----Original Message----- 
From: Scott Smith
Sent: Thursday, November 15, 2012 11:57 AM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?

Thanks for the suggestions I think Erick is correct as well.  I'll let the 
customer decide.

Here's an updated list.  Fyi--the minStem was the English Minimal Stemmer--I 
changed the label.  Interesting to see where the minimal stemmer and porter 
agree (and KStemmer doesn't).  You may also find the "dog" examples 
interesting.  I also found the "invest*" list entertaining.

   original       porter        kstem   EngMinStem
-----------  -----------  -----------  -----------
    country      countri      country      country
  countries      countri      country      country
  country's     country'    country's     country'
        run          run          run          run
       runs          run         runs          run
    running          run      running      running
       read         read         read         read
    reading         read      reading      reading
     reader       reader       reader       reader
association       associ  association  association
  associate       associ    associate    associate
    listing         list         list      listing
      water        water        water        water
    watered        water        water      watered
       sure         sure         sure         sure
     surely         sure       surely       surely
     invest       invest       invest       invest
  investing       invest       invest    investing
investment       invest   investment   investment
investments       invest   investment   investment
    invests       invest       invest       invest
   investor     investor       invest     investor
   invester       invest       invest     invester
  investors     investor       invest     investor
  investers       invest       invest     invester
organization        organ  organization  organization
   organize        organ     organize     organize
    organic        organ      organic      organic
   generous        gener     generous     generous
    generic        gener      generic      generic
        dog          dog          dog          dog
      dog's         dog'        dog's         dog'
       dogs          dog         dogs          dog
      dogs'          dog         dogs          dog

Now, if someone would answer my question on the Solr list ("Custom Solr 
Indexer/Search"), my day would be complete ;-).

Thanks for the continued help.

Scott

-----Original Message-----
From: Tom Burton-West [mailto:tburtonw@umich.edu]
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I agree with Erick that you probably need to give your client a list of 
concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won't get searched.  For example, without stemming, 
searching for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming 
increases overstemming.  The problem with aggressive stemmers like the 
Porter stemmer, is that they overstem.

The original Porter stemmer for example would stem "organization" and " 
organic" both to "organ" and "generalization" , "generous"and "generic" to " 
gener"  *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer 
understems or overstems and explains the logic of Kstem: "Viewing Morphology 
as an Inference Process"  (*Krovetz*, R., Proceedings of the Sixteenth 
Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Which stemmer?

Posted by Scott Smith <ss...@mainstreamdata.com>.
Thanks for the suggestions I think Erick is correct as well.  I'll let the customer decide.

Here's an updated list.  Fyi--the minStem was the English Minimal Stemmer--I changed the label.  Interesting to see where the minimal stemmer and porter agree (and KStemmer doesn't).  You may also find the "dog" examples interesting.  I also found the "invest*" list entertaining.

   original       porter        kstem   EngMinStem
-----------  -----------  -----------  -----------
    country      countri      country      country
  countries      countri      country      country
  country's     country'    country's     country'
        run          run          run          run
       runs          run         runs          run
    running          run      running      running
       read         read         read         read
    reading         read      reading      reading
     reader       reader       reader       reader
association       associ  association  association
  associate       associ    associate    associate
    listing         list         list      listing
      water        water        water        water
    watered        water        water      watered
       sure         sure         sure         sure
     surely         sure       surely       surely
     invest       invest       invest       invest
  investing       invest       invest    investing
 investment       invest   investment   investment
investments       invest   investment   investment
    invests       invest       invest       invest
   investor     investor       invest     investor
   invester       invest       invest     invester
  investors     investor       invest     investor
  investers       invest       invest     invester
organization        organ  organization  organization
   organize        organ     organize     organize
    organic        organ      organic      organic
   generous        gener     generous     generous
    generic        gener      generic      generic
        dog          dog          dog          dog
      dog's         dog'        dog's         dog'
       dogs          dog         dogs          dog
      dogs'          dog         dogs          dog

Now, if someone would answer my question on the Solr list ("Custom Solr Indexer/Search"), my day would be complete ;-).

Thanks for the continued help.

Scott

-----Original Message-----
From: Tom Burton-West [mailto:tburtonw@umich.edu] 
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I agree with Erick that you probably need to give your client a list of concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won't get searched.  For example, without stemming, searching for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming increases overstemming.  The problem with aggressive stemmers like the Porter stemmer, is that they overstem.

 The original Porter stemmer for example would stem "organization" and " organic" both to "organ" and "generalization" , "generous"and "generic" to " gener"  *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by Dmitri Mamrukov <dy...@att.net>.

Sent from my iPhone

On Nov 16, 2012, at 7:18 PM, "Igal @ getRailo.org" <ig...@getrailo.org> wrote:
R
> This message cannot be displayed because of the way it is formatted. Ask the sender to send it again using a different format or email program. text/plainydckcu

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by Lance Norskog <go...@gmail.com>.
Nope! This slang term only exists in the plural. The kind of prose with this usage may not follow standard grammatical and spelling rules anyway. Historically, text search has been funded mostly by the US intelligence agencies because they want to analyze formal and technical prose. And, it is coded by people who think in good grammar, and are perfect spellers.

If you find 'too aggressive' and 'too mild' to be a problem, what you want is 'lemmatization' where you work from a dictionary of word forms. Solr supports using Wordnet for this purpose.

Lance

----- Original Message -----
| From: "Igal @ getRailo.org" <ig...@getrailo.org>
| To: java-user@lucene.apache.org
| Sent: Friday, November 16, 2012 4:18:20 PM
| Subject: Re: Which stemmer?
| 
| but if "dogs" are feet (and I guess I fall into the not-perfect group
| here)...  and "feet" is the plural form of "foot", then shouldn't
| "dogs"
| be stemmed to "dog" as a base, singular form?
| 
| 
| 
| On 11/16/2012 2:32 PM, Tom Burton-West wrote:
| > Hi Mike,
| >
| >>> Honestly I've never heard of anyone using "dogs" to mean feet
| >>> either, but
| > hey nobody's perfect.
| >
| > This is really off topic but I couldn't resist.  This usage of
| > "dogs" to
| > mean feet occurs in old blues lyrics such as Blind Lemon
| > Jefferson's "Hot
| > Dogs"
| > http://www.youtube.com/watch?v=v670qVwzm9c
| > (Hard to make out what he's singing on the old 78, but he's says
| > his "dogs"
| > is red hot, meaning he can run really fast.)
| > http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/
| >
| > Tom
| >
| 
| 
| ---------------------------------------------------------------------
| To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
| For additional commands, e-mail: java-user-help@lucene.apache.org
| 
| 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by "Igal @ getRailo.org" <ig...@getrailo.org>.
but if "dogs" are feet (and I guess I fall into the not-perfect group 
here)...  and "feet" is the plural form of "foot", then shouldn't "dogs" 
be stemmed to "dog" as a base, singular form?



On 11/16/2012 2:32 PM, Tom Burton-West wrote:
> Hi Mike,
>
>>> Honestly I've never heard of anyone using "dogs" to mean feet either, but
> hey nobody's perfect.
>
> This is really off topic but I couldn't resist.  This usage of "dogs" to
> mean feet occurs in old blues lyrics such as Blind Lemon Jefferson's "Hot
> Dogs"
> http://www.youtube.com/watch?v=v670qVwzm9c
> (Hard to make out what he's singing on the old 78, but he's says his "dogs"
> is red hot, meaning he can run really fast.)
> http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/
>
> Tom
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by Tom Burton-West <tb...@umich.edu>.
Hi Mike,

>>Honestly I've never heard of anyone using "dogs" to mean feet either, but
hey nobody's perfect.

This is really off topic but I couldn't resist.  This usage of "dogs" to
mean feet occurs in old blues lyrics such as Blind Lemon Jefferson's "Hot
Dogs"
http://www.youtube.com/watch?v=v670qVwzm9c
(Hard to make out what he's singing on the old 78, but he's says his "dogs"
is red hot, meaning he can run really fast.)
http://jasobrecht.com/blind-lemon-jefferson-star-blues-guitar/

Tom

Re: Which stemmer?

Posted by Michael Sokolov <so...@ifactory.com>.
On 11/15/2012 1:06 PM, Tom Burton-West wrote:
> This paper on the Kstem stemmer lists cases where the Porter stemmer
> understems or overstems and explains the logic of Kstem: "Viewing
> Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
> Sixteenth Annual International ACM SIGIR Conference on Research and
> Development in Information Retrieval, 191-203, 1993).
>
> *http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
> "
>
Thanks for the reference - that was very enlightening.  The paper 
explains why many terms are not stemmed as one might expect by KStem - 
words that are found in the dictionary, by which I think they mean have 
their own senses whose definitions do not include the stem word, are not 
stemmed by KStem since it assumes that they have their own particular 
meanings, and are not derived *purely by inflection*.

The dictionary they used is the Longman dictionary, which is available 
for free online.  I looked up "dog" 
http://www.ldoceonline.com/dictionary/dog_1 and found that there is a 
sense there (sense 13) whose definition reads:


    dogs

[plural]American Englishinformalfeet:

this sense doesn't mention the stem word "dog" - it clearly has a 
different meaning than the main dog entry, so I guess the thinking 
behind this is: if the person was searching for "dogs" (meaning feet) 
they wouldn't want to find text with "dog" (meaning man's best friend).  
Of course in this case, "dog" singular presumably could mean foot as 
well, so the inference seems faulty, although perhaps that never 
occurs?  Honestly I've never heard of anyone using "dogs" to mean feet 
either, but hey nobody's perfect.

This entry: http://www.ldoceonline.com/dictionary/bound_4 probably 
explains the reason "bounds" doesn't stem to "bound".

In the Lucene KStemmer code, this translates into the word appearing in 
one of the dictionary data files.  If a word appears there (as "dogs" 
and "bounds" do), it won't be stemmed.  I suppose a possible approach 
here would be to send the client the dictionary of non-stemming words 
and let them remove some, but then you'd have to compile your own 
KStemmer variant.

Perhaps a nice feature to add to KStemmer would be to have it read a 
list of exception words at run-time that would be removed from its 
dictionary in order to allow them to be stemmed.

-Mike


Re: Which stemmer?

Posted by Michael Sokolov <so...@falutin.net>.
On 11/15/2012 1:06 PM, Tom Burton-West wrote:
> This paper on the Kstem stemmer lists cases where the Porter stemmer
> understems or overstems and explains the logic of Kstem: "Viewing
> Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
> Sixteenth Annual International ACM SIGIR Conference on Research and
> Development in Information Retrieval, 191-203, 1993).
>
> *http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
> "
>
Thanks for the reference - that was very enlightening.  The paper 
explains why many terms are not stemmed as one might expect by KStem - 
words that are found in the dictionary, by which I think they mean have 
their own senses whose definitions do not include the stem word, are not 
stemmed by KStem since it assumes that they have their own particular 
meanings, and are not derived *purely by inflection*.

The dictionary they used is the Longman dictionary, which is available 
for free online.  I looked up "dog" 
http://www.ldoceonline.com/dictionary/dog_1 and found that there is a 
sense there (sense 13) whose definition reads:


    dogs

[plural]American Englishinformalfeet:

this sense doesn't mention the stem word "dog" - it clearly has a 
different meaning than the main dog entry, so I guess the thinking 
behind this is: if the person was searching for "dogs" (meaning feet) 
they wouldn't want to find text with "dog" (meaning man's best friend).  
Of course in this case, "dog" singular presumably could mean foot as 
well, so the inference seems faulty, although perhaps that never 
occurs?  Honestly I've never heard of anyone using "dogs" to mean feet 
either, but hey nobody's perfect.

This entry: http://www.ldoceonline.com/dictionary/bound_4 probably 
explains the reason "bounds" doesn't stem to "bound".

In the Lucene KStemmer code, this translates into the word appearing in 
one of the dictionary data files.  If a word appears there (as "dogs" 
and "bounds" do), it won't be stemmed.  I suppose a possible approach 
here would be to send the client the dictionary of non-stemming words 
and let them remove some, but then you'd have to compile your own 
KStemmer variant.

Perhaps a nice feature to add to KStemmer would be to have it read a 
list of exception words at run-time that would be removed from its 
dictionary in order to allow them to be stemmed.

-Mike

Re: Which stemmer?

Posted by Tom Burton-West <tb...@umich.edu>.
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won’t get searched.  For example, without stemming, searching
for “dogs” would not retrieve documents containing the word “dog”.
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming.  The problem with aggressive stemmers like the
Porter stemmer, is that they overstem.

 The original Porter stemmer for example would stem “organization” and “
organic” both to “organ” and “generalization” , “generous”and “generic” to “
gener”  *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

Re: Which stemmer?

Posted by Erick Erickson <er...@gmail.com>.
I'd make it easy for myself. Generate (programmatically), a list like you
showed for a _lot_ more terms, send it to your customer, and let _them_
pick. Unfortunately, the customer has no idea what "aggressive" means (for
that matter, I don't know how porter handles specific words for that
matter, I always have to try it). By putting concrete examples in front of
them, and framing it with "all the words that reduce to the same stem will
be considered matches and return" you can give them enough info to make a
choice.

FWIW,
Erick


On Wed, Nov 14, 2012 at 9:11 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> Another word set to try: invest, investing, investment, investments,
> invests, investor, invester, investors, investers.
>
> Also, take a look at EnglishMinimalStemmer (**
> EnglishMinimalStemFilterFactor**y) for minimal stemming.
>
> See:
> http://lucene.apache.org/core/**4_0_0/analyzers-common/org/**
> apache/lucene/analysis/en/**EnglishMinimalStemFilterFactor**y.html<http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilterFactory.html>
> http://lucene.apache.org/core/**4_0_0/analyzers-common/org/**
> apache/lucene/analysis/en/**EnglishMinimalStemmer.html<http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemmer.html>
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Scott Smith
> Sent: Wednesday, November 14, 2012 5:17 PM
> To: java-user@lucene.apache.org
> Subject: RE: Which stemmer?
>
>
> Unfortunately, my "use case" is a customer who wants stemming, but has
> very little knowledge of what that means except they think they want it.
>
> I agree with your last comment.  So, here's my contribution:
>
>  Original      porter       kstem     minStem
>   -------     -------     -------     -------
>   country     countri     country     country
>       run         run         run         run
>      runs         run        runs         run
>   running         run     running     running
>      read        read        read        read
>   reading        read     reading     reading
>    reader      reader      reader      reader
> association     associ association association
> associate      associ   associate   associate
>   listing        list        list     listing
>     water       water       water       water
>   watered       water       water     watered
>      sure        sure        sure        sure
>    surely        sure      surely      surely
>    fred's       fred'      fred's       fred'
>     roses        rose        rose        rose
>
> Still not sure which one to pick.  Porter is more aggressive.  Min stemmer
> is pretty minimal.  Perhaps the kstemmer is "just right" :-)
>
> Cheers
>
> Scott
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.**com<ja...@basetechnology.com>
> ]
> Sent: Wednesday, November 14, 2012 4:14 PM
> To: java-user@lucene.apache.org
> Subject: Re: Which stemmer?
>
> What is your use case? If you don't have a specific use case in mind, try
> each of them with some common words that you expect will or won't be
> stemmed. If you have Solr, you can experiment interactively using the Solr
> Admin Analysis web page.
>
> It would be nice if the javadoc for each stemmer gave a handful of
> examples that illustrated how some common words are stemmed.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Scott Smith
> Sent: Wednesday, November 14, 2012 10:55 AM
> To: java-user@lucene.apache.org
> Subject: Which stemmer?
>
> Does anyone have any experience with the stemmers?  I know that Porter is
> what "everyone" uses.  Am I better off with KStemFilter (better
> performance) or ??  Does anyone understand the differences between the
> various stemmers and how to choose one over another?
>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>

Re: Which stemmer?

Posted by Jack Krupansky <ja...@basetechnology.com>.
Another word set to try: invest, investing, investment, investments, 
invests, investor, invester, investors, investers.

Also, take a look at EnglishMinimalStemmer (EnglishMinimalStemFilterFactory) 
for minimal stemming.

See:
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilterFactory.html
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemmer.html

-- Jack Krupansky

-----Original Message----- 
From: Scott Smith
Sent: Wednesday, November 14, 2012 5:17 PM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?

Unfortunately, my "use case" is a customer who wants stemming, but has very 
little knowledge of what that means except they think they want it.

I agree with your last comment.  So, here's my contribution:

  Original      porter       kstem     minStem
   -------     -------     -------     -------
   country     countri     country     country
       run         run         run         run
      runs         run        runs         run
   running         run     running     running
      read        read        read        read
   reading        read     reading     reading
    reader      reader      reader      reader
association     associ association association
associate      associ   associate   associate
   listing        list        list     listing
     water       water       water       water
   watered       water       water     watered
      sure        sure        sure        sure
    surely        sure      surely      surely
    fred's       fred'      fred's       fred'
     roses        rose        rose        rose

Still not sure which one to pick.  Porter is more aggressive.  Min stemmer 
is pretty minimal.  Perhaps the kstemmer is "just right" :-)

Cheers

Scott

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Wednesday, November 14, 2012 4:14 PM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

What is your use case? If you don't have a specific use case in mind, try 
each of them with some common words that you expect will or won't be 
stemmed. If you have Solr, you can experiment interactively using the Solr 
Admin Analysis web page.

It would be nice if the javadoc for each stemmer gave a handful of examples 
that illustrated how some common words are stemmed.

-- Jack Krupansky

-----Original Message-----
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?

Does anyone have any experience with the stemmers?  I know that Porter is 
what "everyone" uses.  Am I better off with KStemFilter (better performance) 
or ??  Does anyone understand the differences between the various stemmers 
and how to choose one over another?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Which stemmer?

Posted by Scott Smith <ss...@mainstreamdata.com>.
Unfortunately, my "use case" is a customer who wants stemming, but has very little knowledge of what that means except they think they want it.  

I agree with your last comment.  So, here's my contribution:

  Original      porter       kstem     minStem
   -------     -------     -------     -------
   country     countri     country     country
       run         run         run         run
      runs         run        runs         run
   running         run     running     running
      read        read        read        read
   reading        read     reading     reading
    reader      reader      reader      reader
association     associ association association
 associate      associ   associate   associate
   listing        list        list     listing
     water       water       water       water
   watered       water       water     watered
      sure        sure        sure        sure
    surely        sure      surely      surely
    fred's       fred'      fred's       fred'
     roses        rose        rose        rose

Still not sure which one to pick.  Porter is more aggressive.  Min stemmer is pretty minimal.  Perhaps the kstemmer is "just right" :-)

Cheers

Scott

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: Wednesday, November 14, 2012 4:14 PM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

What is your use case? If you don't have a specific use case in mind, try each of them with some common words that you expect will or won't be stemmed. If you have Solr, you can experiment interactively using the Solr Admin Analysis web page.

It would be nice if the javadoc for each stemmer gave a handful of examples that illustrated how some common words are stemmed.

-- Jack Krupansky

-----Original Message-----
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?

Does anyone have any experience with the stemmers?  I know that Porter is what "everyone" uses.  Am I better off with KStemFilter (better performance) or ??  Does anyone understand the differences between the various stemmers and how to choose one over another? 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Which stemmer?

Posted by Jack Krupansky <ja...@basetechnology.com>.
What is your use case? If you don't have a specific use case in mind, try 
each of them with some common words that you expect will or won't be 
stemmed. If you have Solr, you can experiment interactively using the Solr 
Admin Analysis web page.

It would be nice if the javadoc for each stemmer gave a handful of examples 
that illustrated how some common words are stemmed.

-- Jack Krupansky

-----Original Message----- 
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?

Does anyone have any experience with the stemmers?  I know that Porter is 
what "everyone" uses.  Am I better off with KStemFilter (better performance) 
or ??  Does anyone understand the differences between the various stemmers 
and how to choose one over another? 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org