You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Adrien Gallou <ad...@gmail.com> on 2019/07/23 12:53:57 UTC

Question about the light and minimal French stemmers

Hi,

I'm using both light and minimal French stemmers and encountered an issue
when using the minimal stemmer.

The light stemmer removes the last character of a word if the last two
characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
In this light stemmer, there is a check to avoid altering the token if the
token is a number.

The minimal stemmer also removes the last character of a word if the last
two characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77

But in this minimal stemmer there is no check to see if the character is a
letter or not.
So when we have numeric tokens with the last two characters identical they
are altered.

Is there a reason for this?
Should I file an issue on Jira to add this check?

Thanks,

Adrien Gallou