You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Lukas Vlcek (JIRA)" <ji...@apache.org> on 2013/07/16 19:44:59 UTC

[jira] [Commented] (LUCENE-4311) HunspellStemFilter returns another values than Hunspell in console / command line with same dictionaries.

    [ https://issues.apache.org/jira/browse/LUCENE-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709978#comment-13709978 ] 

Lukas Vlcek commented on LUCENE-4311:
-------------------------------------

Hi Chris,

I have been doing some experiments with this czech dictionary and to me it seems that it yields the best results with RECURSION_CAP = 0. Seriously! The double folding does not bring any advantage in case of this particular dictionary. In fact the dictionary is in such a good shape that it allows for direct generation of all word forms for words in dic file and only one affix rule is enough for input words to see if it matches any of the root forms, no folding needed at all.

With RECURSION_CAP 1 or 2 it can generate a lot of incorrect words. The shorter the input word is the higher chance of getting incorrect (i.e. completely misleading) results up to the point where it is not useful for Lucene indexing at all.

Please, can we have this fixed? I believe all is needed now is to have a look at #LUCENE-4542 and make sure the recursion level is configurable. This would be really great enhancement.
                
> HunspellStemFilter returns another values than Hunspell in console / command line with same dictionaries.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4311
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4311
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/other
>    Affects Versions: 3.5, 4.0-ALPHA, 3.6.1
>         Environment: Apache Solr 3.5 - 4.0, Apache Tomcat 7.0
>            Reporter: Jan Rieger
>         Attachments: cs_CZ.aff, cs_CZ.dic
>
>
> When I used HunspellStemFilter for stemming the czech language text, it returns me bad results.
> For example word "praha" returns "praha" and "prahnout", what is not correct.
> So I try the same in my console (Hunspell command line) with exactly same dictionaries and it returns only "praha" and this is correct.
> Can somebody help me?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org