You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/22 09:32:57 UTC

[GitHub] [lucene] xaviersanchez opened a new pull request #461: LUCENE-10248: Spanish Plural Stemmer

xaviersanchez opened a new pull request #461:
URL: https://github.com/apache/lucene/pull/461


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   [Jira ticket](https://issues.apache.org/jira/browse/LUCENE-10248)
   
   Code for the Spanish Plural Stemmer. This is a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. Our goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.
   
   
   In the following [article](https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373) we made a comparison of different Spanish Stemmers and use cases and which value adds our contribution 
   
   # Solution
   
   
   
   Algorithmic approach Spanish rules for building plural forms
   based on rules defined in wikilengua (http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n))
   ## Characteristics
   * Designed to stem just plural to singular form
   * Distinguishes between masculine and feminine forms
   * It will increase recall but precision can be reduced depending on the use case/information need
   * Stems plural words of foreign origin
     * i.e. complots, bits, punks, robots
   * Support for invariant words: same plural and singular form or plural does not make sense
     * crisis, jueves, lapsus, abrebotellas, etc
   * Support for special cases
     * yoes, clubes, itemes, faralaes
   * Use it when the distinction between singular and plural is not relevant but gender is relevant
   * Produces meaningful tokens in form of singular
    * Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
   
   # Tests
   
   * TestSpanishPluralStemFilter: testing the stemmer filter itself. 
     * Testing random strings
     * empty terms
     * selection of +28k Spanish words
   * TestSpanishPluralStemFilterFactory: test on the filter factory.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #461: LUCENE-10248: Spanish Plural Stemmer

Posted by GitBox <gi...@apache.org>.
rmuir commented on pull request #461:
URL: https://github.com/apache/lucene/pull/461#issuecomment-975876699


   Hi @xaviersanchez, this contribution looks great.
   
   I'll do another pass on review and give some time for others to review as well.
   
   I did a little investigation at a glance, and I think it is confusing that the current `SpanishMinimalStemmer` is doing aggressive conversions such as `ñ -> n`. I think, as a followup issue, we should `@deprecate` the `SpanishMinimalStemmer` and point users to this one instead? 
   
   `SpanishMinimalStemmer` is not a typical "upstream" algorithm, with academic papers/study from snowball or savoy, and there doesn't seem to be any reason to keep it anymore, except for a legacy index. So we could keep it around for another major release or so but not forever, IMO.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir merged pull request #461: LUCENE-10248: Spanish Plural Stemmer

Posted by GitBox <gi...@apache.org>.
rmuir merged pull request #461:
URL: https://github.com/apache/lucene/pull/461


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] xaviersanchez commented on pull request #461: LUCENE-10248: Spanish Plural Stemmer

Posted by GitBox <gi...@apache.org>.
xaviersanchez commented on pull request #461:
URL: https://github.com/apache/lucene/pull/461#issuecomment-981735988


   > Hi @xaviersanchez, this contribution looks great.
   > 
   > I'll do another pass on review and give some time for others to review as well.
   > 
   > I did a little investigation at a glance, and I think it is confusing that the current `SpanishMinimalStemmer` is doing aggressive conversions such as `ñ -> n`. I think, as a followup issue, we should `@deprecate` the `SpanishMinimalStemmer` and point users to this one instead?
   > 
   > `SpanishMinimalStemmer` is not a typical "upstream" algorithm, with academic papers/study from snowball or savoy, and there doesn't seem to be any reason to keep it anymore, except for a legacy index. So we could keep it around for another major release or so but not forever, IMO.
   
   Thanks @rmuir for the comment! 
   
   Yes, I agree we could deprecate SpanishMinimalStemmer and point the users to this implementation since it can cover the same use cases. We implemented this a while ago so, before contributing our code, we did the analysis of the different behaviors of the Spanish stemmers just for checking we could provide some added value. From our analysis we see that SpanishMinimalStemmer has some issues and does some quite aggressive text normalization. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org