You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2014/02/17 15:20:20 UTC

[jira] [Commented] (LUCENE-5421) MorfologicFilter doesn't stem legitimate uppercase terms (surnames, proper nouns, etc.)

    [ https://issues.apache.org/jira/browse/LUCENE-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903263#comment-13903263 ] 

Dawid Weiss commented on LUCENE-5421:
-------------------------------------

Sorry for the delay. I looked at your patch. You see, the problem here is not code-based but dictionary-based. Morfologik dictionaries contain "correct" word forms, they are not directly suitable for spell checking, which you're trying to do here. The test case which you commented out is exactly about that, consider this:

     assertAnalyzesTo(a, "Aarona",   new String[] { "Aaron" });
     assertAnalyzesTo(a, "aarona",   new String[] { "aarona" });

In the first case, the dictionary indeed contains a word form "Aarona", which then gets stemmed to the root form "Aaron". Here's a dictionary excerpt:

Aarona	Aaron	subst:sg:acc:m1+subst:sg:gen:m1

In the second case, the word form "aarona" is *not* in the dictionary (because Aaron is a proper name and should be capitalized in Polish) so no entries are found in the dictionary. Because this component is a filter, it just returns the token as-is ("aarona") and leaves it in.

Having reconsidered this issue I think the filter should not be fixed as you described it (by checking for the first-letter uppercase form). "poznania" is a good example when this isn't the best idea -- there is a clear grammatical difference between "Poznania" and "poznania" and this filter shouldn't blur this difference.

You have two choices:

1) you can recompile the morfologik's FSA dictionary and lowercase (or otherwise normalize) all word forms; then the filter will find "sienkiewicza" and other capitalized words. This isn't as complicated as it seems, check out morfologik-stemming from github, build it, dump pl.dict data in raw format, edit and recompile the FSA again.

2) you can write a filter which will try to correct mispelled words. The complexity of solutions here can vary from very simple unigram analysis (morfologik-stemming actually has a class called Speller which is an out-of-the-box help here) to context-aware analysis which is probably better and will yield more sensible corrections.

> MorfologicFilter doesn't stem legitimate uppercase terms (surnames, proper nouns, etc.)
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5421
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5421
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.6.1
>            Reporter: Grzegorz Sobczyk
>            Assignee: Dawid Weiss
>            Priority: Minor
>
> Morfologic filter search stems in input or lowercase format:
> org.apache.lucene.analysis.morfologik.MorfologikFilter.incrementToken()
> {code}
> if (!keywordAttr.isKeyword() && (lookupSurfaceForm(termAtt) || lookupSurfaceForm(toLowercase(termAtt)))) {
>   [...]
> }
> {code}
> In this situation, if input token is *sienkiewicza* - it isn't stemmed
> but: *Sienkiewicza* --> *Sienkiewicz*
> for comparison:
> *pRoDuKtY* --> *produkt*
> It should stem also input token with capitalized first letter



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org