You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Gallou (JIRA)" <ji...@apache.org> on 2019/07/28 19:46:00 UTC

[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

     [ https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Gallou updated LUCENE-8937:
----------------------------------
       Attachment: 0002-check-if-the-last-character-is-a-letter-before-remov.patch
                   0001-adds-test-cases-on-french-minimal-stemmer.patch
                   SOLR-8937.patch
    Lucene Fields: New,Patch Available  (was: New)
           Status: Open  (was: Open)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---------------------------------------------------------------
>
>                 Key: LUCENE-8937
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8937
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Gallou
>            Priority: Major
>         Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 0002-check-if-the-last-character-is-a-letter-before-remov.patch, SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
> characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> In this light stemmer, there is a check to avoid altering the token if the
> token is a number.
> The minimal stemmer also removes the last character of a word if the last
> two characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
> letter or not.
> So when we have numeric tokens with the last two characters identical they
> are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org