You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2004/04/22 23:14:42 UTC

Stemmer Benefits/Costs

I've been experimenting with the Porter and Snowball stemmers.  It seems to me that one of the most valuable benefits these provide is the capability to generalize phrase terms.  As a very simple example, without the stemmer, I might need to include three phrase terms in my query: "north korea", "north korean", "north koreans".  But with the stemmer only one will suffice.  To me, that's a huge advantage.  (For non-phrases, the advantage doesn't seem to be so great, because much the same effect can be achieved with wildcards.)

But there seems to be a price that you also pay, in that discrimination may be adversely affected.  If you want to discriminate between two terms that the stemmer views as derived from the same root, you're out of luck (I think).  The problem with this is that you may start with a set of terms that don't have this problem, but over time as new content is added to the index, such problems may gradually get introduced - often unpredictably.  And to the best of my (admittedly limited) knowledge, once you've indexed using a stemmer, there's no way to override it in specific instances.

Appreciate any comments, thoughts on the above.

Regards,

Terry

Re: Stemmer Benefits/Costs

Posted by Terry Steichen <te...@net-frame.com>.

Andrzej,

Sorry for misspelling your name.  My Polish sucks.

Terry

----- Original Message ----- 
From: "Terry Steichen" <te...@net-frame.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, April 22, 2004 7:56 PM
Subject: Re: Stemmer Benefits/Costs


> So, Andrez - Thank you for your comments - what you say makes a good deal
of
> sense.  When you have lots of different inflections that all share the
same
> root, stemming can clearly provide significant (recall) benefits (in terms
> of catching hidden words and/or simplifying the query).
>
> However, would you say that "from the perspective of English" ("with its
> minimal inflection") the points I raise are correct?  (You seem to say so
> with the statement that stemming "usually improves recall, but lowers
> precision.")
>
> And, would you expect significant benefits from the Egothor project code
> (versus Snowball/Porter) when the text is in English (as opposed to a
highly
> inflectional language like Polish)?
>
> Regards,
>
> Terry
>
> ----- Original Message ----- 
> From: "Andrzej Bialecki" <ab...@getopt.org>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Thursday, April 22, 2004 5:37 PM
> Subject: Re: Stemmer Benefits/Costs
>
>
> > Terry Steichen wrote:
> >
> > > I've been experimenting with the Porter and Snowball stemmers.  It
> > > seems to me that one of the most valuable benefits these provide is
> > > the capability to generalize phrase terms.  As a very simple example,
> > > without the stemmer, I might need to include three phrase terms in my
> > > query: "north korea", "north korean", "north koreans".  But with the
> > > stemmer only one will suffice.  To me, that's a huge advantage.  (For
> > > non-phrases, the advantage doesn't seem to be so great, because much
> > > the same effect can be achieved with wildcards.)
> >
> > That's because you look at it from the perspective of English language
> > with its minimal inflection... My mother tongue is Polish - a highly
> > inflectional language from the Slavic family of languages. It is normal
> > for a single Polish word to have as many as 20+ different inflected
> > forms (plural/singular/dual, tense, gender, mood, case, infinitive...
> > enough? ;-) ). For this type of language studies show that stemming (or
> > rather lemmatization - bringing words to their base grammatical forms)
> > significantly improves recall in IR systems.
> >
> > >
> > > But there seems to be a price that you also pay, in that
> > > discrimination may be adversely affected.  If you want to
> > > discriminate between two terms that the stemmer views as derived from
> > > the same root, you're out of luck (I think).  The problem with this
> >
> > Stemming usually improves recall, but lowers precision. For some systems
> > it is more desirable to provide any results, even if they are not quite
> > correct, than to provide none.
> >
> > > is that you may start with a set of terms that don't have this
> > > problem, but over time as new content is added to the index, such
> > > problems may gradually get introduced - often unpredictably.  And to
> > > the best of my (admittedly limited) knowledge, once you've indexed
> > > using a stemmer, there's no way to override it in specific instances.
> >
> > You can always store in your index stemmed/non-stemmed terms alongside.
> >
> > >
> > > Appreciate any comments, thoughts on the above.
> >
> > For highly-inflectional languages I had _very_ good results with
> > stemmers built using the code from Egothor project
> > (http://www.egothor.org) - much more sophisticated than simple
> > rule-based stemmers like Snowball or Porter. In fact, after proper
> > training on a large corpus I was getting ~70% of correct lemmas for
> > previously unseen words, and over 90% of correct (unique) stems.
> >
> > -- 
> > Best regards,
> > Andrzej Bialecki
> >
> > -------------------------------------------------
> > Software Architect, System Integration Specialist
> > CEN/ISSS EC Workshop, ECIMF project chair
> > EU FP6 E-Commerce Expert/Evaluator
> > -------------------------------------------------
> > FreeBSD developer (http://www.freebsd.org)
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Stemmer Benefits/Costs

Posted by Andrzej Bialecki <ab...@getopt.org>.

Terry Steichen wrote:

> So, Andrez - Thank you for your comments - what you say makes a good deal of
> sense.  When you have lots of different inflections that all share the same
> root, stemming can clearly provide significant (recall) benefits (in terms
> of catching hidden words and/or simplifying the query).
> 
> However, would you say that "from the perspective of English" ("with its
> minimal inflection") the points I raise are correct?  (You seem to say so
> with the statement that stemming "usually improves recall, but lowers
> precision.")
> 
> And, would you expect significant benefits from the Egothor project code
> (versus Snowball/Porter) when the text is in English (as opposed to a highly
> inflectional language like Polish)?

I did only minimal testing with English (and somewhat more extensive 
with scandinavian languages). Results were also promising, if somewhat 
unexpected.

The unexpected part comes from the realization that stemming doesn't 
have to produce any real "root" as long as it provides you with a unique 
key for all inflected forms derived from the same base form (lemma). So, 
from this point of view it's perfectly ok if you get "blurfl" as a stem 
of "give", as long as you get the same "blurfl" for "gave, given, gives, 
giving", and for nothing else. Think of it like a hashCode() ...

Egothor's stemmer package is not as abstract as in this example - 
usually (> 90% for English? ~70% for Polish) stems that it produces 
correspond to real base forms (lemmas). It's algorithm is based on state 
machines with memory, stored in a trie, in form of patch commands. This 
allows it to handle not only suffix-based inflection but also 
prefix/infix. Stemming tables are learned from training corpora, which 
consit of base and inflected forms. Resulting state machine binary in 
case of Polish weighs around 300kB. For English it's smaller - somewhere 
around 100kB. In my experiments the state machine for English was able 
to find correct lemmas more often than other types of stemmers (sorry 
for such a poorly qualified statement - as I said, I didn't make so 
systematic testing for English). Over/under-stemming didn't occur as 
often. So, my cautious advice would be to give it a try.. :-)

Now, if you again use the analogy of hash code, inevitably some 
collisions will occur during stemming. I.e. some inflected forms, which 
correspond to different lemmas (roots) will be brought to the same stem. 
  This is where you lose precision. Also, in some cases stemmers will 
produce two stems from a group of words having the same lemma. This is 
where you lose recall (because now the "stem" covers only a subset of 
all possible inflections).

This leads to an interesting conclusion: by blindly using stemmers 
(which may not be suitable for your corpus, or the document's language), 
in one sweep you can nicely lower BOTH the precision and recall 
measures! Clearly, not what one would expect or desire ... :-)

To summarize: IMHO indiscriminate use of stemming, soundexing and other 
language-specific techniques in general is more likely to reduce the 
quality of your results. However, when applied correctly, for a 
well-known corpus and use cases, it can bring significant increase in 
recall, and only a minimal penalty in precision.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Stemmer Benefits/Costs

Posted by Terry Steichen <te...@net-frame.com>.

So, Andrez - Thank you for your comments - what you say makes a good deal of
sense.  When you have lots of different inflections that all share the same
root, stemming can clearly provide significant (recall) benefits (in terms
of catching hidden words and/or simplifying the query).

However, would you say that "from the perspective of English" ("with its
minimal inflection") the points I raise are correct?  (You seem to say so
with the statement that stemming "usually improves recall, but lowers
precision.")

And, would you expect significant benefits from the Egothor project code
(versus Snowball/Porter) when the text is in English (as opposed to a highly
inflectional language like Polish)?

Regards,

Terry

----- Original Message ----- 
From: "Andrzej Bialecki" <ab...@getopt.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, April 22, 2004 5:37 PM
Subject: Re: Stemmer Benefits/Costs


> Terry Steichen wrote:
>
> > I've been experimenting with the Porter and Snowball stemmers.  It
> > seems to me that one of the most valuable benefits these provide is
> > the capability to generalize phrase terms.  As a very simple example,
> > without the stemmer, I might need to include three phrase terms in my
> > query: "north korea", "north korean", "north koreans".  But with the
> > stemmer only one will suffice.  To me, that's a huge advantage.  (For
> > non-phrases, the advantage doesn't seem to be so great, because much
> > the same effect can be achieved with wildcards.)
>
> That's because you look at it from the perspective of English language
> with its minimal inflection... My mother tongue is Polish - a highly
> inflectional language from the Slavic family of languages. It is normal
> for a single Polish word to have as many as 20+ different inflected
> forms (plural/singular/dual, tense, gender, mood, case, infinitive...
> enough? ;-) ). For this type of language studies show that stemming (or
> rather lemmatization - bringing words to their base grammatical forms)
> significantly improves recall in IR systems.
>
> >
> > But there seems to be a price that you also pay, in that
> > discrimination may be adversely affected.  If you want to
> > discriminate between two terms that the stemmer views as derived from
> > the same root, you're out of luck (I think).  The problem with this
>
> Stemming usually improves recall, but lowers precision. For some systems
> it is more desirable to provide any results, even if they are not quite
> correct, than to provide none.
>
> > is that you may start with a set of terms that don't have this
> > problem, but over time as new content is added to the index, such
> > problems may gradually get introduced - often unpredictably.  And to
> > the best of my (admittedly limited) knowledge, once you've indexed
> > using a stemmer, there's no way to override it in specific instances.
>
> You can always store in your index stemmed/non-stemmed terms alongside.
>
> >
> > Appreciate any comments, thoughts on the above.
>
> For highly-inflectional languages I had _very_ good results with
> stemmers built using the code from Egothor project
> (http://www.egothor.org) - much more sophisticated than simple
> rule-based stemmers like Snowball or Porter. In fact, after proper
> training on a large corpus I was getting ~70% of correct lemmas for
> previously unseen words, and over 90% of correct (unique) stems.
>
> -- 
> Best regards,
> Andrzej Bialecki
>
> -------------------------------------------------
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -------------------------------------------------
> FreeBSD developer (http://www.freebsd.org)
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Stemmer Benefits/Costs

Posted by Andrzej Bialecki <ab...@getopt.org>.

Terry Steichen wrote:

> I've been experimenting with the Porter and Snowball stemmers.  It
> seems to me that one of the most valuable benefits these provide is
> the capability to generalize phrase terms.  As a very simple example,
> without the stemmer, I might need to include three phrase terms in my
> query: "north korea", "north korean", "north koreans".  But with the
> stemmer only one will suffice.  To me, that's a huge advantage.  (For
> non-phrases, the advantage doesn't seem to be so great, because much
> the same effect can be achieved with wildcards.)

That's because you look at it from the perspective of English language 
with its minimal inflection... My mother tongue is Polish - a highly 
inflectional language from the Slavic family of languages. It is normal 
for a single Polish word to have as many as 20+ different inflected 
forms (plural/singular/dual, tense, gender, mood, case, infinitive... 
enough? ;-) ). For this type of language studies show that stemming (or 
rather lemmatization - bringing words to their base grammatical forms) 
significantly improves recall in IR systems.

> 
> But there seems to be a price that you also pay, in that
> discrimination may be adversely affected.  If you want to
> discriminate between two terms that the stemmer views as derived from
> the same root, you're out of luck (I think).  The problem with this

Stemming usually improves recall, but lowers precision. For some systems 
it is more desirable to provide any results, even if they are not quite 
correct, than to provide none.

> is that you may start with a set of terms that don't have this
> problem, but over time as new content is added to the index, such
> problems may gradually get introduced - often unpredictably.  And to
> the best of my (admittedly limited) knowledge, once you've indexed
> using a stemmer, there's no way to override it in specific instances.

You can always store in your index stemmed/non-stemmed terms alongside.

> 
> Appreciate any comments, thoughts on the above.

For highly-inflectional languages I had _very_ good results with 
stemmers built using the code from Egothor project 
(http://www.egothor.org) - much more sophisticated than simple 
rule-based stemmers like Snowball or Porter. In fact, after proper 
training on a large corpus I was getting ~70% of correct lemmas for 
previously unseen words, and over 90% of correct (unique) stems.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org