You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Lukas Vlcek (JIRA)" <ji...@apache.org> on 2014/03/02 13:35:19 UTC
[jira] [Comment Edited] (LUCENE-5468) Hunspell very high memory use when loading dictionary

    [ https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917384#comment-13917384 ] 

Lukas Vlcek edited comment on LUCENE-5468 at 3/2/14 12:34 PM:
--------------------------------------------------------------

Hi Robert,

I created a new ticket LUCENE-5484 for distinct recursion levels per pre/suffix rules.

{quote}
There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is :)
{quote}

I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception. :) Heh...

But seriously. How about if I open a ticket for this to allow to fly this idea around. WDYT?

I would like to try to implement it as well (if no one else will do it) though I will not get to it soon. As for the 80/20 aspect the good thing about this feature is that it could be measured (precision, recall, ...). And may be only implementing this feature cold tell us if it is useful or not.

{quote}
\[...\] if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java \[...\]
{quote}

Short: I agree

Long: Creating a new dictionary is very hard. It is for wizards... but the thing here Robert is that *creating* a new dictionary from scratch is something completely different than *extending* existing dictionary. At least for average users (like me), they probably can hardly do the former but can relatively easy do the later. The former involves creating the affix rules, the later means using given affix rules and build on top of it.

When I was trying to extend existing dictionary then in fact I had to do the following:
1) identify words that were missing in the dict file (or files)
2) assigning some of existing rules to each of them
3) verify #2 was done right

As for 1) that is easy (the only trick when creating a new file with missing words is to stick to encoding defined in aff file)
As for 2) that is harder but in my case I was building on top of relatively large dictionary so I could bet on the fact that language morphology has been already cover well in affix rules (so I assumed I was not introducing words with new/unique morphology to the dictionary). So in fact instead of trying to understand the rules (see my note about this below) I searched for words that should have similar morphology features and used their rules (for example if I were to add a word "fail" I would search for "sail" and use the same rules).
As for 3) this in fact means expanding token in root form according to all possible valid rules and check it all makes sense. As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature.

{color:gray}
Note:
As for understanding the affix rules - this is probably complex topic and I did not have a time to dig deep enough to say anything qualified about it yet. However, as far as I understand various \*spell systems have various limitations. For example in case of the Czech dictionary, it is an ispell which allowed only limited number of affix rules (that is what I understood from conversation with an author of Czech dictionary). Which means that if the number of rules is limited then what we see being shipped in aff file is more a result of some preprocessing that takes set of rules that are understandable to human and produces more compact set that might not be easily understood by humans.

But this is unrelated topic, except that it can illustrate the situation of average user who just want to add some new words into existing dictionary and do not have the capacity to become an expert on ispell (or myspell, or aspell, ... or ... you name it).
{color}



was (Author: lukas.vlcek):
Hi Robert,

I created a new ticket LUCENE-5484 for distinct recursion levels per pre/suffix rules.

{quote}
There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is :)
{quote}

I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception. :) Heh...

But seriously. How about if I open a ticket for this to allow to fly this idea around. WDYT?

I would like to try to implement it as well (if no one else will do it) though I will not get to it soon. As for the 80/20 aspect the good thing about this feature is that it could be measured (precision, recall, ...). And may be implementing this feature cold tell us if it is useful or not.

{quote}
\[...\] if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java \[...\]
{quote}

Short: I agree

Long: Creating a new dictionary is very hard. It is for wizards... but the thing here Robert is that *creating* a new dictionary from scratch is something completely different than *extending* existing dictionary. At least for average users (like me), they probably can hardly do the former but can relatively easy do the later. The former involves creating the affix rules, the later means using given affix rules and build on top of it.

When I was trying to extend existing dictionary then in fact I had to do the following:
1) identify words that were missing in the dict file (or files)
2) assigning some of existing rules to each of them
3) verify #2 was done right

As for 1) that is easy (the only trick when creating a new file with missing words is to stick to encoding defined in aff file)
As for 2) that is harder but in my case I was building on top of relatively large dictionary so I could bet on the fact that language morphology has been already cover well in affix rules (so I assumed I was not introducing words with new/unique morphology to the dictionary). So in fact instead of trying to understand the rules (see my note about this below) I searched for words that should have similar morphology features and used their rules (for example if I were to add a word "fail" I would search for "sail" and use the same rules).
As for 3) this in fact means expanding token in root form according to all possible valid rules and check it all makes sense. As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature.

{color:gray}
Note:
As for understanding the affix rules - this is probably complex topic and I did not have a time to dig deep enough to say anything qualified about it yet. However, as far as I understand various \*spell systems have various limitations. For example in case of the Czech dictionary, it is an ispell which allowed only limited number of affix rules (that is what I understood from conversation with an author of Czech dictionary). Which means that if the number of rules is limited then what we see being shipped in aff file is more a result of some preprocessing that takes set of rules that are understandable to human and produces more compact set that might not be easily understood by humans.

But this is unrelated topic, except that it can illustrate the situation of average user who just want to add some new words into existing dictionary and do not have the capacity to become an expert on ispell (or myspell, or aspell, ... or ... you name it).
{color}


> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: LUCENE-5468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5468
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause whole core to crash with various out of memory errors unless you set max heap size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org