You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Alessandro Depase (JIRA)" <ji...@apache.org> on 2017/07/10 07:54:00 UTC

[jira] [Updated] (OPENNLP-1115) Document Categorizer all events dropped

     [ https://issues.apache.org/jira/browse/OPENNLP-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Depase updated OPENNLP-1115:
---------------------------------------
    Description: 
Hi all,
I'm trying to perform my first (newbie) document categorization using italian language.
I'm using the attached train file and i got this output:

{{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train" -encoding UTF-8
Indexing events using cutoff of 5

        Computing event counts...  done. 12 events
        Indexing...  Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=decisamente, bow=non, bow=male]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, bow=così, bow=contento]
Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=niente, bow=affatto, bow=divertente]
Dropped event no:[bow=va, bow=malissimo]
Dropped event no:[bow=va, bow=decisamente, bow=male]
Dropped event no:[bow=sono, bow=molto, bow=triste]
done.
Sorting and merging events...

ERROR: Not enough training data
The provided training data is not sufficient to create enough events to train a model.
To resolve this error use more training data, if this doesn't help there might
be some fundamental problem with the training data itself.}}

I already found a couple of other similar issues, just saying that there are not enough lines (but I have 6 lines for each category and a cutoff of 5) or that without at least 100 lines the categorization quality is not sufficient (ok, but that's just a quality matter, it should work, with bad results, but it should work). The reason for insufficient data is that all the lines are dropped.
I also tried with java api, same result.
But why? What did I miss? I cannot find useful documentation...

Thank you in advance
Kind Regards
    Alessandro

  was:
Hi all,
I'm trying to perform my first (newbie) document categorization using italian language.
I'm using the attached train file and i got this output:

{{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train" -encoding UTF-8
Indexing events using cutoff of 5

        Computing event counts...  done. 12 events
        Indexing...  Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=decisamente, bow=non, bow=male]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, bow=così, bow=contento]
Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=niente, bow=affatto, bow=divertente]
Dropped event no:[bow=va, bow=malissimo]
Dropped event no:[bow=va, bow=decisamente, bow=male]
Dropped event no:[bow=sono, bow=molto, bow=triste]
done.
Sorting and merging events...

ERROR: Not enough training data
The provided training data is not sufficient to create enough events to train a model.
To resolve this error use more training data, if this doesn't help there might
be some fundamental problem with the training data itself.}}

I already found a couple of other similar issues, just saying that there are not enough lines (but I have 6 lines for each category and a cutoff of 5) or that without at least 100 lines the categorization quality is not sufficient (ok, but that's just a quality matter, it should work, with bad results, but it should work). The reason for insufficient data is that all the lines are dropped.
But why? What did I miss? I cannot find useful documentation...

Thank you in advance
Kind Regards
    Alessandro


> Document Categorizer all events dropped
> ---------------------------------------
>
>                 Key: OPENNLP-1115
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1115
>             Project: OpenNLP
>          Issue Type: Question
>          Components: Doccat
>    Affects Versions: 1.7.2
>            Reporter: Alessandro Depase
>         Attachments: Train1.train
>
>
> Hi all,
> I'm trying to perform my first (newbie) document categorization using italian language.
> I'm using the attached train file and i got this output:
> {{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train" -encoding UTF-8
> Indexing events using cutoff of 5
>         Computing event counts...  done. 12 events
>         Indexing...  Dropped event Ok:[bow=ok]
> Dropped event Ok:[bow=tutto, bow=bene]
> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> Dropped event Ok:[bow=fantastica, bow=scelta]
> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, bow=così, bow=contento]
> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> Dropped event no:[bow=per, bow=nulla]
> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> Dropped event no:[bow=va, bow=malissimo]
> Dropped event no:[bow=va, bow=decisamente, bow=male]
> Dropped event no:[bow=sono, bow=molto, bow=triste]
> done.
> Sorting and merging events...
> ERROR: Not enough training data
> The provided training data is not sufficient to create enough events to train a model.
> To resolve this error use more training data, if this doesn't help there might
> be some fundamental problem with the training data itself.}}
> I already found a couple of other similar issues, just saying that there are not enough lines (but I have 6 lines for each category and a cutoff of 5) or that without at least 100 lines the categorization quality is not sufficient (ok, but that's just a quality matter, it should work, with bad results, but it should work). The reason for insufficient data is that all the lines are dropped.
> I also tried with java api, same result.
> But why? What did I miss? I cannot find useful documentation...
> Thank you in advance
> Kind Regards
>     Alessandro



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)