You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Xavier Sanchez Loro (Jira)" <ji...@apache.org> on 2021/11/22 09:30:00 UTC

[jira] [Created] (LUCENE-10248) Add SpanishPluralStemFilter

Xavier Sanchez Loro created LUCENE-10248:
--------------------------------------------

             Summary: Add SpanishPluralStemFilter
                 Key: LUCENE-10248
                 URL: https://issues.apache.org/jira/browse/LUCENE-10248
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
    Affects Versions: 9.0
            Reporter: Xavier Sanchez Loro


We propose a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. Our goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

In the following [article|https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373] we made a comparison of different Spanish Stemmers and use cases and which value adds our contribution

Our Solution is an algorithmic approach Spanish rules for building plural forms
based on rules defined in [wikilengua| http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n)]

Some characteristics:
 * Designed to stem just plural to singular form
 * Distinguishes between masculine and feminine forms
 * It will increase recall but precision can be reduced depending on the use case/information need
 * Stems plural words of foreign origin: i.e. complots, bits, punks, robots
 * Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
 * Support for special cases: i.e. yoes, clubes, itemes, faralaes
 * Use it when the distinction between singular and plural is not relevant but gender is relevant
 * Produces meaningful tokens in form of singular
 ** Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org