You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andriy Rysin (JIRA)" <ji...@apache.org> on 2016/06/21 03:52:57 UTC

[jira] [Commented] (LUCENE-7348) Add dynamic stemmer for Ukrainian

    [ https://issues.apache.org/jira/browse/LUCENE-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341014#comment-15341014 ] 

Andriy Rysin commented on LUCENE-7348:
--------------------------------------

[~mikemccand] Hey Michael,
I've analyzed the inflection rules we have in dict_uk project (https://github.com/arysin/dict_uk) and it has ~4500 inflection rules (most of those are simple match but some are regexps). Those rules cover almost all possible affixes. I can probably drop rare and homonimic ones to make it below 4k but then the question comes up where to go next?
1) having all the rules would be nice as it'll provide high accuracy and high level of compatibility with the dictionary-based lemmatizer created in LUCENE-7287 (we could probably even make a hybrid solution)
2) having smaller/simpler will benefit the performance (but to simplify it properly we would have to analyze the frequency/importance of each rule)
3) is lemmatizing analysis good or stemming is preferred? for real stemming we would have to work more on the rules to find the (pseudo)roots for each inflection rule

I tried to look at existing light stemmers and many are very basic. It looks like we're going in reverse and I am trying to understand if already having complex solution we want to make it simpler (it looks that the only benefit will be performance)? I also tried to google on how to do the stemming "right" but nothing serious jumped at me especially applicable for Slavic languages.

Thanks.


> Add dynamic stemmer for Ukrainian
> ---------------------------------
>
>                 Key: LUCENE-7348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7348
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Andriy Rysin
>            Priority: Minor
>              Labels: analysis, language
>
> We're adding a dictionary based lemmatizing analyzer for Ukrainian in https://issues.apache.org/jira/browse/LUCENE-7287.
> It would be nice to have a dynamic stemmer that can handle words that are not in the dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org