You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Alessandro Depase <al...@gmail.com> on 2017/07/11 13:44:47 UTC

Document Categorizer all events dropped

Hi all,
I'm trying to perform my first (newbie) document categorization using
italian language.
I'm using a very simple file with this content:

Ok ok
Ok tutto bene
Ok decisamente non male
Ok fantastica scelta
Ok non pensavo di poter essere così contento
Ok certamente un'ottimo risultato
no non va affatto bene
no per nulla
no niente affatto divertente
no va malissimo
no va decisamente male
no sono molto triste

(no lines before or after the quoted ones - and, yes, I know that in
Italian "un'ottimo" is an error, but it was part of my list :) ) and i got
this output:

$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
"C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
-encoding UTF-8
Indexing events using cutoff of 5

    Computing event counts... done. 12 events
    Indexing... Dropped event Ok:[bow=ok]

Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=decisamente, bow=non, bow=male]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
bow=così, bow=contento]
Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=niente, bow=affatto, bow=divertente]
Dropped event no:[bow=va, bow=malissimo]
Dropped event no:[bow=va, bow=decisamente, bow=male]
Dropped event no:[bow=sono, bow=molto, bow=triste]
done.
Sorting and merging events...

ERROR: Not enough training data
The provided training data is not sufficient to create enough events to
train a model.
To resolve this error use more training data, if this doesn't help there
might
be some fundamental problem with the training data itself.

I already found a couple of other similar issues on the Internet, just
saying that there are not enough lines (but I have 6 lines for each
category and a cutoff of 5) or that without at least 100 lines the
categorization quality is not sufficient (ok, but that's just a quality
matter, it should work, with bad results, but it should work). The reason
for insufficient data is that all the lines are dropped. Someone seems to
succeed with even 10 lines.
But why? What did I miss? I cannot find useful documentation...

Please note that my question is about *why* the lines are dropped, about
the reason, the logic behind dropping them.
I tried to understand the code, (I stopped when it required too much time
without downloading and debugging it) and that's what I understood:
*the AbstractDataIndexer throws the exception in the method _sortAndMerge
_because it "thinks" there isn't enough data* but it uses the *List
eventsToCompare*, which is the result of a previous computation, which
happens in the same class, *method index(ObjectStream<Event> events,
Map<String, Integer=""> predicateIndex)*
* there the code builds a int[] starting from each line in a way I cannot
completely understood (my question, at the very end, is: what is the logic
behind the compilation of this array?). If the array has more than an
element, then ok, we have elements to compare (and the sortAndMerge will
not throw this Exception), else the line is dropped. So: what is the logic
behind dropping the line?
The documentation, just talks about the cutoff value, but I compiled more
lines than requested by the cutoff.
So: to complete the question, is there a way to quantifiy the minimum
quantity of lines or words or whatever needed? Why are available online
examples working with 10 lines and my example not? I don't mind the quality
here, I completely understand that it will not produce a meaningful result
in a real case, but why I got an Excepion and other not?

In the meanwhile I tried with more or less 15 lines and it returned no
exception. The quality of categorization was very low, as expected (it
almost always returned "ok", also to sentences in the training set - is it
related to the fact that the corresponding lines were dropped and the train
happened only on few others?). With 29 lines it becomes to give meaningful
answers, nonetheless the questions remain.

Thank you in advance for your support
Kind Regards
Alessandro

Re: Document Categorizer all events dropped

Posted by Alessandro Depase <al...@gmail.com>.
Thank you for your further detail.
Sure, your answer clarifies better the messages and the terminology, but
still I don't completely understand.
I already tried it with a cutoff of 1, after Joern answer and it worked the
same.
Moreover, I tried it with other cutoff values (and also with the bigger
file in attachment).

Some extracts from different outputs (with my - to be considered in
incremental way - notes, search for ******** ):

Cutoff=0 or Cutoff=1 (No line dropped and everything clear to me)
Indexing events using cutoff of 1

        Computing event counts...  done. 48 events
        Indexing...  done.
Sorting and merging events... done. Reduced 48 events to 48.
Done indexing.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 48
            Number of Outcomes: 3
          Number of Predicates: 94
------------------------------------------------------------------------------
Indexing events using cutoff of 2

        Computing event counts...  done. 48 events
        Indexing...  Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
******** WHY????
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla] ******** UHMMM (stop words? Is this
the reason why there is the language parameter?)
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=servizio, bow=inaccettabile] ******** ??? Here I
cannot understand, there are no reasonable stop words, so previous
hypothesis doesn't apply
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti]  ******** WHY? I
misunderstood something, I think
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 35 events to 33.
Done indexing.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 33
            Number of Outcomes: 3
          Number of Predicates: 24

------------------------------------------------------------------------------
Indexing events using cutoff of 3

        Computing event counts...  done. 48 events
        Indexing...  Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=mi, bow=sono, bow=molto, bow=divertito]  ********
WHY? This wasn't in the set of dropped lines in the previous run
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
Dropped event Ok:[bow=fantastica, bow=soluzione]
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=ritornare, bow=con, bow=amici, bow=e, bow=parenti]
******** WHY? This wasn't in the set of dropped lines in the previous run
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=sono, bow=molto, bow=triste]
Dropped event no:[bow=servizio, bow=inaccettabile]
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti]
Dropped event Insomma:[bow=ma, bow=anche, bow=meglio]
Dropped event Insomma:[bow=nè, bow=caldo, bow=nè, bow=freddo]
Dropped event Insomma:[bow=mi, bow=lascia, bow=assolutamente,
bow=indifferente]
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 26 events to 23.
Done indexing.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 23
            Number of Outcomes: 3
          Number of Predicates: 11

------------------------------------------------------------------------------

Indexing events using cutoff of 4

        Computing event counts...  done. 48 events
        Indexing...  Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=mi, bow=sono, bow=molto, bow=divertito]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
Dropped event Ok:[bow=fantastica, bow=soluzione]
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=ritornare, bow=con, bow=amici, bow=e, bow=parenti]
Dropped event Ok:[bow=esperienza, bow=sicuramente, bow=entusiasmante]
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=va, bow=malissimo] ******** WHY???? This is a real
surprise: it wasn't in no previous run and while in the previous cases it
seemed to me to have missed some details, this drop event, if compared with
previous runs, completely makes me think I didn't understand anything
Dropped event no:[bow=sono, bow=molto, bow=triste]
Dropped event no:[bow=come, bow=fare, bow=ad, bow=essere, bow=contenti]
Dropped event no:[bow=servizio, bow=inaccettabile]
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti]
Dropped event Insomma:[bow=così, bow=così] ******** WHY??? Same surprise
Dropped event Insomma:[bow=tutto, bow=sommato, bow=poteva, bow=essere,
bow=peggio] ******** WHY?
Dropped event Insomma:[bow=ma, bow=anche, bow=meglio]
Dropped event Insomma:[bow=nè, bow=caldo, bow=nè, bow=freddo]
Dropped event Insomma:[bow=mi, bow=lascia, bow=assolutamente,
bow=indifferente]
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 21 events to 14.
Done indexing.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 14
            Number of Outcomes: 3
          Number of Predicates: 6

------------------------------------------------------------------------------
Indexing events using cutoff of 5

        Computing event counts...  done. 48 events
        Indexing...  Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=mi, bow=sono, bow=molto, bow=divertito]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
Dropped event Ok:[bow=fantastica, bow=soluzione]
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=niente, bow=affatto, bow=male] ******** WHY???
Another surprise
Dropped event Ok:[bow=ritornare, bow=con, bow=amici, bow=e, bow=parenti]
Dropped event Ok:[bow=esperienza, bow=sicuramente, bow=entusiasmante]
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=niente, bow=affatto, bow=divertente]
Dropped event no:[bow=va, bow=malissimo]
Dropped event no:[bow=va, bow=decisamente, bow=male]
Dropped event no:[bow=sono, bow=molto, bow=triste]
Dropped event no:[bow=come, bow=fare, bow=ad, bow=essere, bow=contenti]
Dropped event no:[bow=servizio, bow=inaccettabile]
Dropped event no:[bow=male, bow=davvero]
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti]
Dropped event Insomma:[bow=così, bow=così]
Dropped event Insomma:[bow=tutto, bow=sommato, bow=poteva, bow=essere,
bow=peggio]
Dropped event Insomma:[bow=ma, bow=anche, bow=meglio]
Dropped event Insomma:[bow=nè, bow=caldo, bow=nè, bow=freddo]
Dropped event Insomma:[bow=mi, bow=lascia, bow=assolutamente,
bow=indifferente]
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 17 events to 6.
Done indexing.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 6
            Number of Outcomes: 3
          Number of Predicates: 2

As you can see, it's not a matter of making it work (now, from a "user"
point of view it seems to work), but it is a matter of understanding (also
to give hint on how to compile the set to final users).
Thank you again
    Alessandro


2017-07-13 22:02 GMT+02:00 Daniel Russ <dr...@apache.org>:

> Hello Alessandro,
>   Jörn is correct, you don’t have enough data.  But let’s force it to work.
>
> Every line in your training file is a Document and also a training EVENT.
> An event is an Outcome and a set of features [also known as the context].
>
> so the first line of your data is
>
> Ok ok  <- this is the event with outcome=Ok and context ok.  The
> DocumentTrainer make it [bow=ok] because it is a Bag Of Words model.
> Ok tutto bene <- next event with outcome Ok context = [tutto and bene]
> …
>
> when all finished the feature ok, tutto, bene and others occurs once in
> your in training set.  I think only “non” occurs more than once (it occurs
> 3 times).
> OpenNLP counts the number of time each feature occurs, if the number is
> less than the cutoff, it removes the feature.  If an event has no features,
> OpenNLP drops the event.
> All of your events are dropped.
>
> I ran it with
>
> opennlp  DoccatTrainer -params it-params.txt -lang it -model modelIt.bin
> -data Train1.train
> where it-params.txt is a file containing
> Cutoff=0
>
> Indexing events using cutoff of 0
>
>         Computing event counts...  done. 12 events
>         Indexing...  done.
> Sorting and merging events... done. Reduced 12 events to 12.
> Done indexing.
> Incorporating indexed data for training...
> done.
>         Number of Event Tokens: 12
>             Number of Outcomes: 2
>           Number of Predicates: 27
> ...done.
> Computing model parameters ...
> Performing 100 iterations.
>   1:  ... loglikelihood=-8.317766166719343      0.5
>   2:  ... loglikelihood=-7.093654793495284      1.0
>   3:  ... loglikelihood=-6.256360369219492      1.0
>  =======================================
>  99:  ... loglikelihood=-0.6058512089422219     1.0
> 100:  ... loglikelihood=-0.6002320355445091     1.0
> Writing document categorizer model ... done (0.048s)
>
> Wrote document categorizer model to
> path: /Users/druss/modelIt.bin
>
> Did I answer your question?
> Daniel
>
> > On Jul 11, 2017, at 11:12 AM, Alessandro Depase <
> alessandro.depase@gmail.com> wrote:
> >
> > Thank you.
> > So, if I correctly understand, the cutoff is related to features and not
> to
> > lines as I wrongly understood from some Internet examples (and not
> > carefully reading the documentation too... my bad)
> >
> > Reading the documentation, I find that if no feature generator has been
> > specified, "Bag of words" is used.
> >
> > It's not so clear that this means "tested on every single training line"
> > and not "on every category" (after your answer maybe more clear, indeed).
> > Roughly speaking, this means that a training line with more words than
> the
> > cutoff (more than 5 words according to the default) is kept and with less
> > than the cutoff is dropped, is it correct?
> >
> > If this is correct, using the defaults, it should suffice lower the
> cutoff
> > to 1, not to zero (a line with no word is meaningless anyway - and I
> think
> > already tested before as well formatted).
> >
> > I'll make some tests in this direction, thank you so much
> > Alessandro
> >
> >
> >
> > 2017-07-11 16:40 GMT+02:00 Joern Kottmann <ko...@gmail.com>:
> >
> >> An event is dropped when the cutoff is so high that all features are
> >> removed from that event.
> >> I recommend to train with more data or to decrease the cutoff value to
> >> zero.
> >>
> >> Jörn
> >>
> >> On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase
> >> <al...@gmail.com> wrote:
> >>> Hi all,
> >>> I'm trying to perform my first (newbie) document categorization using
> >>> italian language.
> >>> I'm using a very simple file with this content:
> >>>
> >>> Ok ok
> >>> Ok tutto bene
> >>> Ok decisamente non male
> >>> Ok fantastica scelta
> >>> Ok non pensavo di poter essere così contento
> >>> Ok certamente un'ottimo risultato
> >>> no non va affatto bene
> >>> no per nulla
> >>> no niente affatto divertente
> >>> no va malissimo
> >>> no va decisamente male
> >>> no sono molto triste
> >>>
> >>> (no lines before or after the quoted ones - and, yes, I know that in
> >>> Italian "un'ottimo" is an error, but it was part of my list :) ) and i
> >> got
> >>> this output:
> >>>
> >>> $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
> >>> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\
> >> sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
> >>> -encoding UTF-8
> >>> Indexing events using cutoff of 5
> >>>
> >>>    Computing event counts... done. 12 events
> >>>    Indexing... Dropped event Ok:[bow=ok]
> >>>
> >>> Dropped event Ok:[bow=tutto, bow=bene]
> >>> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> >>> Dropped event Ok:[bow=fantastica, bow=scelta]
> >>> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
> >>> bow=così, bow=contento]
> >>> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> >>> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> >>> Dropped event no:[bow=per, bow=nulla]
> >>> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> >>> Dropped event no:[bow=va, bow=malissimo]
> >>> Dropped event no:[bow=va, bow=decisamente, bow=male]
> >>> Dropped event no:[bow=sono, bow=molto, bow=triste]
> >>> done.
> >>> Sorting and merging events...
> >>>
> >>> ERROR: Not enough training data
> >>> The provided training data is not sufficient to create enough events to
> >>> train a model.
> >>> To resolve this error use more training data, if this doesn't help
> there
> >>> might
> >>> be some fundamental problem with the training data itself.
> >>>
> >>> I already found a couple of other similar issues on the Internet, just
> >>> saying that there are not enough lines (but I have 6 lines for each
> >>> category and a cutoff of 5) or that without at least 100 lines the
> >>> categorization quality is not sufficient (ok, but that's just a quality
> >>> matter, it should work, with bad results, but it should work). The
> reason
> >>> for insufficient data is that all the lines are dropped. Someone seems
> to
> >>> succeed with even 10 lines.
> >>> But why? What did I miss? I cannot find useful documentation...
> >>>
> >>> Please note that my question is about *why* the lines are dropped,
> about
> >>> the reason, the logic behind dropping them.
> >>> I tried to understand the code, (I stopped when it required too much
> time
> >>> without downloading and debugging it) and that's what I understood:
> >>> *the AbstractDataIndexer throws the exception in the method
> _sortAndMerge
> >>> _because it "thinks" there isn't enough data* but it uses the *List
> >>> eventsToCompare*, which is the result of a previous computation, which
> >>> happens in the same class, *method index(ObjectStream<Event> events,
> >>> Map<String, Integer=""> predicateIndex)*
> >>> * there the code builds a int[] starting from each line in a way I
> cannot
> >>> completely understood (my question, at the very end, is: what is the
> >> logic
> >>> behind the compilation of this array?). If the array has more than an
> >>> element, then ok, we have elements to compare (and the sortAndMerge
> will
> >>> not throw this Exception), else the line is dropped. So: what is the
> >> logic
> >>> behind dropping the line?
> >>> The documentation, just talks about the cutoff value, but I compiled
> more
> >>> lines than requested by the cutoff.
> >>> So: to complete the question, is there a way to quantifiy the minimum
> >>> quantity of lines or words or whatever needed? Why are available online
> >>> examples working with 10 lines and my example not? I don't mind the
> >> quality
> >>> here, I completely understand that it will not produce a meaningful
> >> result
> >>> in a real case, but why I got an Excepion and other not?
> >>>
> >>> In the meanwhile I tried with more or less 15 lines and it returned no
> >>> exception. The quality of categorization was very low, as expected (it
> >>> almost always returned "ok", also to sentences in the training set - is
> >> it
> >>> related to the fact that the corresponding lines were dropped and the
> >> train
> >>> happened only on few others?). With 29 lines it becomes to give
> >> meaningful
> >>> answers, nonetheless the questions remain.
> >>>
> >>> Thank you in advance for your support
> >>> Kind Regards
> >>> Alessandro
> >>
>
>

Re: Document Categorizer all events dropped

Posted by Daniel Russ <dr...@apache.org>.
Hello Alessandro,
  Jörn is correct, you don’t have enough data.  But let’s force it to work.

Every line in your training file is a Document and also a training EVENT.
An event is an Outcome and a set of features [also known as the context].

so the first line of your data is

Ok ok  <- this is the event with outcome=Ok and context ok.  The DocumentTrainer make it [bow=ok] because it is a Bag Of Words model.
Ok tutto bene <- next event with outcome Ok context = [tutto and bene]
…

when all finished the feature ok, tutto, bene and others occurs once in your in training set.  I think only “non” occurs more than once (it occurs 3 times).
OpenNLP counts the number of time each feature occurs, if the number is less than the cutoff, it removes the feature.  If an event has no features, OpenNLP drops the event.
All of your events are dropped.

I ran it with 

opennlp  DoccatTrainer -params it-params.txt -lang it -model modelIt.bin -data Train1.train 
where it-params.txt is a file containing
Cutoff=0

Indexing events using cutoff of 0

	Computing event counts...  done. 12 events
	Indexing...  done.
Sorting and merging events... done. Reduced 12 events to 12.
Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 12
	    Number of Outcomes: 2
	  Number of Predicates: 27
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-8.317766166719343	0.5
  2:  ... loglikelihood=-7.093654793495284	1.0
  3:  ... loglikelihood=-6.256360369219492	1.0
 =======================================
 99:  ... loglikelihood=-0.6058512089422219	1.0
100:  ... loglikelihood=-0.6002320355445091	1.0
Writing document categorizer model ... done (0.048s)

Wrote document categorizer model to
path: /Users/druss/modelIt.bin

Did I answer your question?
Daniel

> On Jul 11, 2017, at 11:12 AM, Alessandro Depase <al...@gmail.com> wrote:
> 
> Thank you.
> So, if I correctly understand, the cutoff is related to features and not to
> lines as I wrongly understood from some Internet examples (and not
> carefully reading the documentation too... my bad)
> 
> Reading the documentation, I find that if no feature generator has been
> specified, "Bag of words" is used.
> 
> It's not so clear that this means "tested on every single training line"
> and not "on every category" (after your answer maybe more clear, indeed).
> Roughly speaking, this means that a training line with more words than the
> cutoff (more than 5 words according to the default) is kept and with less
> than the cutoff is dropped, is it correct?
> 
> If this is correct, using the defaults, it should suffice lower the cutoff
> to 1, not to zero (a line with no word is meaningless anyway - and I think
> already tested before as well formatted).
> 
> I'll make some tests in this direction, thank you so much
> Alessandro
> 
> 
> 
> 2017-07-11 16:40 GMT+02:00 Joern Kottmann <ko...@gmail.com>:
> 
>> An event is dropped when the cutoff is so high that all features are
>> removed from that event.
>> I recommend to train with more data or to decrease the cutoff value to
>> zero.
>> 
>> Jörn
>> 
>> On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase
>> <al...@gmail.com> wrote:
>>> Hi all,
>>> I'm trying to perform my first (newbie) document categorization using
>>> italian language.
>>> I'm using a very simple file with this content:
>>> 
>>> Ok ok
>>> Ok tutto bene
>>> Ok decisamente non male
>>> Ok fantastica scelta
>>> Ok non pensavo di poter essere così contento
>>> Ok certamente un'ottimo risultato
>>> no non va affatto bene
>>> no per nulla
>>> no niente affatto divertente
>>> no va malissimo
>>> no va decisamente male
>>> no sono molto triste
>>> 
>>> (no lines before or after the quoted ones - and, yes, I know that in
>>> Italian "un'ottimo" is an error, but it was part of my list :) ) and i
>> got
>>> this output:
>>> 
>>> $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
>>> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\
>> sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
>>> -encoding UTF-8
>>> Indexing events using cutoff of 5
>>> 
>>>    Computing event counts... done. 12 events
>>>    Indexing... Dropped event Ok:[bow=ok]
>>> 
>>> Dropped event Ok:[bow=tutto, bow=bene]
>>> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
>>> Dropped event Ok:[bow=fantastica, bow=scelta]
>>> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
>>> bow=così, bow=contento]
>>> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
>>> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
>>> Dropped event no:[bow=per, bow=nulla]
>>> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
>>> Dropped event no:[bow=va, bow=malissimo]
>>> Dropped event no:[bow=va, bow=decisamente, bow=male]
>>> Dropped event no:[bow=sono, bow=molto, bow=triste]
>>> done.
>>> Sorting and merging events...
>>> 
>>> ERROR: Not enough training data
>>> The provided training data is not sufficient to create enough events to
>>> train a model.
>>> To resolve this error use more training data, if this doesn't help there
>>> might
>>> be some fundamental problem with the training data itself.
>>> 
>>> I already found a couple of other similar issues on the Internet, just
>>> saying that there are not enough lines (but I have 6 lines for each
>>> category and a cutoff of 5) or that without at least 100 lines the
>>> categorization quality is not sufficient (ok, but that's just a quality
>>> matter, it should work, with bad results, but it should work). The reason
>>> for insufficient data is that all the lines are dropped. Someone seems to
>>> succeed with even 10 lines.
>>> But why? What did I miss? I cannot find useful documentation...
>>> 
>>> Please note that my question is about *why* the lines are dropped, about
>>> the reason, the logic behind dropping them.
>>> I tried to understand the code, (I stopped when it required too much time
>>> without downloading and debugging it) and that's what I understood:
>>> *the AbstractDataIndexer throws the exception in the method _sortAndMerge
>>> _because it "thinks" there isn't enough data* but it uses the *List
>>> eventsToCompare*, which is the result of a previous computation, which
>>> happens in the same class, *method index(ObjectStream<Event> events,
>>> Map<String, Integer=""> predicateIndex)*
>>> * there the code builds a int[] starting from each line in a way I cannot
>>> completely understood (my question, at the very end, is: what is the
>> logic
>>> behind the compilation of this array?). If the array has more than an
>>> element, then ok, we have elements to compare (and the sortAndMerge will
>>> not throw this Exception), else the line is dropped. So: what is the
>> logic
>>> behind dropping the line?
>>> The documentation, just talks about the cutoff value, but I compiled more
>>> lines than requested by the cutoff.
>>> So: to complete the question, is there a way to quantifiy the minimum
>>> quantity of lines or words or whatever needed? Why are available online
>>> examples working with 10 lines and my example not? I don't mind the
>> quality
>>> here, I completely understand that it will not produce a meaningful
>> result
>>> in a real case, but why I got an Excepion and other not?
>>> 
>>> In the meanwhile I tried with more or less 15 lines and it returned no
>>> exception. The quality of categorization was very low, as expected (it
>>> almost always returned "ok", also to sentences in the training set - is
>> it
>>> related to the fact that the corresponding lines were dropped and the
>> train
>>> happened only on few others?). With 29 lines it becomes to give
>> meaningful
>>> answers, nonetheless the questions remain.
>>> 
>>> Thank you in advance for your support
>>> Kind Regards
>>> Alessandro
>> 


Re: Document Categorizer all events dropped

Posted by Alessandro Depase <al...@gmail.com>.
Thank you.
So, if I correctly understand, the cutoff is related to features and not to
lines as I wrongly understood from some Internet examples (and not
carefully reading the documentation too... my bad)

Reading the documentation, I find that if no feature generator has been
specified, "Bag of words" is used.

It's not so clear that this means "tested on every single training line"
and not "on every category" (after your answer maybe more clear, indeed).
Roughly speaking, this means that a training line with more words than the
cutoff (more than 5 words according to the default) is kept and with less
than the cutoff is dropped, is it correct?

If this is correct, using the defaults, it should suffice lower the cutoff
to 1, not to zero (a line with no word is meaningless anyway - and I think
already tested before as well formatted).

I'll make some tests in this direction, thank you so much
Alessandro



2017-07-11 16:40 GMT+02:00 Joern Kottmann <ko...@gmail.com>:

> An event is dropped when the cutoff is so high that all features are
> removed from that event.
> I recommend to train with more data or to decrease the cutoff value to
> zero.
>
> Jörn
>
> On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase
> <al...@gmail.com> wrote:
> > Hi all,
> > I'm trying to perform my first (newbie) document categorization using
> > italian language.
> > I'm using a very simple file with this content:
> >
> > Ok ok
> > Ok tutto bene
> > Ok decisamente non male
> > Ok fantastica scelta
> > Ok non pensavo di poter essere così contento
> > Ok certamente un'ottimo risultato
> > no non va affatto bene
> > no per nulla
> > no niente affatto divertente
> > no va malissimo
> > no va decisamente male
> > no sono molto triste
> >
> > (no lines before or after the quoted ones - and, yes, I know that in
> > Italian "un'ottimo" is an error, but it was part of my list :) ) and i
> got
> > this output:
> >
> > $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
> > "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\
> sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
> > -encoding UTF-8
> > Indexing events using cutoff of 5
> >
> >     Computing event counts... done. 12 events
> >     Indexing... Dropped event Ok:[bow=ok]
> >
> > Dropped event Ok:[bow=tutto, bow=bene]
> > Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> > Dropped event Ok:[bow=fantastica, bow=scelta]
> > Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
> > bow=così, bow=contento]
> > Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> > Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> > Dropped event no:[bow=per, bow=nulla]
> > Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> > Dropped event no:[bow=va, bow=malissimo]
> > Dropped event no:[bow=va, bow=decisamente, bow=male]
> > Dropped event no:[bow=sono, bow=molto, bow=triste]
> > done.
> > Sorting and merging events...
> >
> > ERROR: Not enough training data
> > The provided training data is not sufficient to create enough events to
> > train a model.
> > To resolve this error use more training data, if this doesn't help there
> > might
> > be some fundamental problem with the training data itself.
> >
> > I already found a couple of other similar issues on the Internet, just
> > saying that there are not enough lines (but I have 6 lines for each
> > category and a cutoff of 5) or that without at least 100 lines the
> > categorization quality is not sufficient (ok, but that's just a quality
> > matter, it should work, with bad results, but it should work). The reason
> > for insufficient data is that all the lines are dropped. Someone seems to
> > succeed with even 10 lines.
> > But why? What did I miss? I cannot find useful documentation...
> >
> > Please note that my question is about *why* the lines are dropped, about
> > the reason, the logic behind dropping them.
> > I tried to understand the code, (I stopped when it required too much time
> > without downloading and debugging it) and that's what I understood:
> > *the AbstractDataIndexer throws the exception in the method _sortAndMerge
> > _because it "thinks" there isn't enough data* but it uses the *List
> > eventsToCompare*, which is the result of a previous computation, which
> > happens in the same class, *method index(ObjectStream<Event> events,
> > Map<String, Integer=""> predicateIndex)*
> > * there the code builds a int[] starting from each line in a way I cannot
> > completely understood (my question, at the very end, is: what is the
> logic
> > behind the compilation of this array?). If the array has more than an
> > element, then ok, we have elements to compare (and the sortAndMerge will
> > not throw this Exception), else the line is dropped. So: what is the
> logic
> > behind dropping the line?
> > The documentation, just talks about the cutoff value, but I compiled more
> > lines than requested by the cutoff.
> > So: to complete the question, is there a way to quantifiy the minimum
> > quantity of lines or words or whatever needed? Why are available online
> > examples working with 10 lines and my example not? I don't mind the
> quality
> > here, I completely understand that it will not produce a meaningful
> result
> > in a real case, but why I got an Excepion and other not?
> >
> > In the meanwhile I tried with more or less 15 lines and it returned no
> > exception. The quality of categorization was very low, as expected (it
> > almost always returned "ok", also to sentences in the training set - is
> it
> > related to the fact that the corresponding lines were dropped and the
> train
> > happened only on few others?). With 29 lines it becomes to give
> meaningful
> > answers, nonetheless the questions remain.
> >
> > Thank you in advance for your support
> > Kind Regards
> > Alessandro
>

Re: Document Categorizer all events dropped

Posted by Joern Kottmann <ko...@gmail.com>.
An event is dropped when the cutoff is so high that all features are
removed from that event.
I recommend to train with more data or to decrease the cutoff value to zero.

Jörn

On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase
<al...@gmail.com> wrote:
> Hi all,
> I'm trying to perform my first (newbie) document categorization using
> italian language.
> I'm using a very simple file with this content:
>
> Ok ok
> Ok tutto bene
> Ok decisamente non male
> Ok fantastica scelta
> Ok non pensavo di poter essere così contento
> Ok certamente un'ottimo risultato
> no non va affatto bene
> no per nulla
> no niente affatto divertente
> no va malissimo
> no va decisamente male
> no sono molto triste
>
> (no lines before or after the quoted ones - and, yes, I know that in
> Italian "un'ottimo" is an error, but it was part of my list :) ) and i got
> this output:
>
> $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
> -encoding UTF-8
> Indexing events using cutoff of 5
>
>     Computing event counts... done. 12 events
>     Indexing... Dropped event Ok:[bow=ok]
>
> Dropped event Ok:[bow=tutto, bow=bene]
> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> Dropped event Ok:[bow=fantastica, bow=scelta]
> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
> bow=così, bow=contento]
> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> Dropped event no:[bow=per, bow=nulla]
> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> Dropped event no:[bow=va, bow=malissimo]
> Dropped event no:[bow=va, bow=decisamente, bow=male]
> Dropped event no:[bow=sono, bow=molto, bow=triste]
> done.
> Sorting and merging events...
>
> ERROR: Not enough training data
> The provided training data is not sufficient to create enough events to
> train a model.
> To resolve this error use more training data, if this doesn't help there
> might
> be some fundamental problem with the training data itself.
>
> I already found a couple of other similar issues on the Internet, just
> saying that there are not enough lines (but I have 6 lines for each
> category and a cutoff of 5) or that without at least 100 lines the
> categorization quality is not sufficient (ok, but that's just a quality
> matter, it should work, with bad results, but it should work). The reason
> for insufficient data is that all the lines are dropped. Someone seems to
> succeed with even 10 lines.
> But why? What did I miss? I cannot find useful documentation...
>
> Please note that my question is about *why* the lines are dropped, about
> the reason, the logic behind dropping them.
> I tried to understand the code, (I stopped when it required too much time
> without downloading and debugging it) and that's what I understood:
> *the AbstractDataIndexer throws the exception in the method _sortAndMerge
> _because it "thinks" there isn't enough data* but it uses the *List
> eventsToCompare*, which is the result of a previous computation, which
> happens in the same class, *method index(ObjectStream<Event> events,
> Map<String, Integer=""> predicateIndex)*
> * there the code builds a int[] starting from each line in a way I cannot
> completely understood (my question, at the very end, is: what is the logic
> behind the compilation of this array?). If the array has more than an
> element, then ok, we have elements to compare (and the sortAndMerge will
> not throw this Exception), else the line is dropped. So: what is the logic
> behind dropping the line?
> The documentation, just talks about the cutoff value, but I compiled more
> lines than requested by the cutoff.
> So: to complete the question, is there a way to quantifiy the minimum
> quantity of lines or words or whatever needed? Why are available online
> examples working with 10 lines and my example not? I don't mind the quality
> here, I completely understand that it will not produce a meaningful result
> in a real case, but why I got an Excepion and other not?
>
> In the meanwhile I tried with more or less 15 lines and it returned no
> exception. The quality of categorization was very low, as expected (it
> almost always returned "ok", also to sentences in the training set - is it
> related to the fact that the corresponding lines were dropped and the train
> happened only on few others?). With 29 lines it becomes to give meaningful
> answers, nonetheless the questions remain.
>
> Thank you in advance for your support
> Kind Regards
> Alessandro