You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by "william.colen@gmail.com" <wi...@gmail.com> on 2011/08/17 18:51:29 UTC

Detailed FMeasure output

Hi,

Would it be useful to have detailed output from FMeasure while using span
with types? For example, we should use it to know individual precision and
recall for person, organization, date in a NameFinder model or for Chunker.
Something the output from
CONLL2000<http://www.cnts.ua.ac.be/conll2000/chunking/output.html>
:

   processed 961 tokens with 459 phrases; found: 539 phrases; correct: 371.
   accuracy:  84.08%; precision:  68.83%; recall:  80.83%; FB1:  74.35
                ADJP: precision:   0.00%; recall:   0.00%; FB1:   0.00
                ADVP: precision:  45.45%; recall:  62.50%; FB1:  52.63
                  NP: precision:  64.98%; recall:  78.63%; FB1:  71.16
                  PP: precision:  83.18%; recall:  98.89%; FB1:  90.36
                SBAR: precision:  66.67%; recall:  33.33%; FB1:  44.44
                  VP: precision:  69.00%; recall:  79.31%; FB1:  73.80

I will need something like that for my master dissertation. If it is useful
I would add it to OpenNLP.

Thanks,
William

Re: Detailed FMeasure output

Posted by Jörn Kottmann <ko...@gmail.com>.

On 8/18/11 12:10 AM, william.colen@gmail.com wrote:
> I thought it would be useful for listener like the one for standard 
> deviation, that should know when a sample stream finishes. I will 
> implement it without the finished method and if it is needed we add it 
> later. 

I am a little skeptical, because that cannot be supported by the current 
API,
a client which just calls evaluateSample has no way to indicate that it 
is now finished.

>> Since we are doing now a little more work here, we should make the abstract
>> Evaluator
>> class handle the calls to the EvaluationSampleListeners.
>>
>> As far as I have seen on my last look in the code we would need to provide
>> a default implementation for the Evaluator.evaluateSample method, and call
>> a method
>> which an implementer overwrites (maybe Evaluator.processSample) from it, it
>> will return the
>> predicted sample, this way an equals test inside Evaluator.evaluateSample
>> can figure out if it needs
>> to call missclassified or correctlyClassified on the registered listeners.
>>
> Won't it break backward compatibility? If it is OK +1 to change it.
>

Yes, but only for classes which extend the Evaluator class, so it might 
be better
to leave this intact, we can still do it, but then we need to declare 
the new method
not as abstract.

Jörn

Re: Detailed FMeasure output

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

On Wed, Aug 17, 2011 at 6:36 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 8/17/11 11:22 PM, william.colen@gmail.com wrote:
>
>> Here is what I am planning to do:
>>
>> I will rename the new listener to EvaluationSampleListener and add two new
>> methods. The method list would be:
>>
>>   void correctlyClassified(T reference, T prediction);
>>
>>   void missclassified(T reference, T prediction);
>>
>>   void evaluationFinished();
>>
>
> Why do we need the one for finished? The calling code knows when it is
> finished,
> and indicates that by retrieving the results, or making a call to print
> them out.
> Like it is implemented currently.
>

I thought it would be useful for listener like the one for standard
deviation, that should know when a sample stream finishes. I will implement
it without the finished method and if it is needed we add it later.


> Since we are doing now a little more work here, we should make the abstract
> Evaluator
> class handle the calls to the EvaluationSampleListeners.
>
> As far as I have seen on my last look in the code we would need to provide
> a default implementation for the Evaluator.evaluateSample method, and call
> a method
> which an implementer overwrites (maybe Evaluator.processSample) from it, it
> will return the
> predicted sample, this way an equals test inside Evaluator.evaluateSample
> can figure out if it needs
> to call missclassified or correctlyClassified on the registered listeners.
>

Won't it break backward compatibility? If it is OK +1 to change it.

Re: Detailed FMeasure output

Posted by Jörn Kottmann <ko...@gmail.com>.

On 8/17/11 11:22 PM, william.colen@gmail.com wrote:
> Here is what I am planning to do:
>
> I will rename the new listener to EvaluationSampleListener and add two new
> methods. The method list would be:
>
>    void correctlyClassified(T reference, T prediction);
>
>    void missclassified(T reference, T prediction);
>
>    void evaluationFinished();

Why do we need the one for finished? The calling code knows when it is 
finished,
and indicates that by retrieving the results, or making a call to print 
them out.
Like it is implemented currently.

Since we are doing now a little more work here, we should make the 
abstract Evaluator
class handle the calls to the EvaluationSampleListeners.

As far as I have seen on my last look in the code we would need to provide
a default implementation for the Evaluator.evaluateSample method, and 
call a method
which an implementer overwrites (maybe Evaluator.processSample) from it, 
it will return the
predicted sample, this way an equals test inside 
Evaluator.evaluateSample can figure out if it needs
to call missclassified or correctlyClassified on the registered listeners.

Jörn

Re: Detailed FMeasure output

Posted by Jörn Kottmann <ko...@gmail.com>.

On 8/17/11 11:22 PM, william.colen@gmail.com wrote:
> Should I do these changes using the same Jira I used to create the
> missclassified listener? Or should I create a new one? It is only a
> refactoring.
Ups, forgot to answer this one. Lets just edit the existing jira, then its
also not confusing in the release notes.

Jörn

Re: Detailed FMeasure output

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Here is what I am planning to do:

I will rename the new listener to EvaluationSampleListener and add two new
methods. The method list would be:

  void correctlyClassified(T reference, T prediction);

  void missclassified(T reference, T prediction);

  void evaluationFinished();

I will make the EvaluationErrorPrinter abstract and implement the
EvaluationSampleListener, adding a default implementation for
correctlyClassified and evaluationFinished. The missclassified should be
implemented by the subclass, like ChunkEvaluationErrorListener.

The abstract class Evaluator will be able to receive a list of listeners in
its constructor, and 3 new methods:

  void notifyCorrectlyClassified(T reference, T prediction);

  void notifyMissclassified(T reference, T prediction);

  void notifyEvaluationFinished();

This methods will call each listener.

The Evaluator.evaluate(ObjectStream<T> samples) will call the
notifyEvaluationFinished() when it finish processing the samples.

These changes will allow us adding more evaluators, including this one that
details FMeasure results.

Should I do these changes using the same Jira I used to create the
missclassified listener? Or should I create a new one? It is only a
refactoring.

Thanks,
William

On Wed, Aug 17, 2011 at 2:18 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> Only slightly related, for cross validation one might also want to
> calculate standard deviation, then its easy to see if there a big outliers
> in the individual computations. They might not be noticeable when
> only the average is printed.
>
>
> Jörn
>
> On 8/17/11 6:51 PM, william.colen@gmail.com wrote:
>
>> Hi,
>>
>> Would it be useful to have detailed output from FMeasure while using span
>> with types? For example, we should use it to know individual precision and
>> recall for person, organization, date in a NameFinder model or for
>> Chunker.
>> Something the output from
>> CONLL2000<http://www.cnts.ua.**ac.be/conll2000/chunking/**output.html<http://www.cnts.ua.ac.be/conll2000/chunking/output.html>
>> >
>> :
>>
>>    processed 961 tokens with 459 phrases; found: 539 phrases; correct:
>> 371.
>>    accuracy:  84.08%; precision:  68.83%; recall:  80.83%; FB1:  74.35
>>                 ADJP: precision:   0.00%; recall:   0.00%; FB1:   0.00
>>                 ADVP: precision:  45.45%; recall:  62.50%; FB1:  52.63
>>                   NP: precision:  64.98%; recall:  78.63%; FB1:  71.16
>>                   PP: precision:  83.18%; recall:  98.89%; FB1:  90.36
>>                 SBAR: precision:  66.67%; recall:  33.33%; FB1:  44.44
>>                   VP: precision:  69.00%; recall:  79.31%; FB1:  73.80
>>
>> I will need something like that for my master dissertation. If it is
>> useful
>> I would add it to OpenNLP.
>>
>> Thanks,
>> William
>>
>>
>

Re: Detailed FMeasure output

Posted by Jörn Kottmann <ko...@gmail.com>.

Only slightly related, for cross validation one might also want to
calculate standard deviation, then its easy to see if there a big outliers
in the individual computations. They might not be noticeable when
only the average is printed.

Jörn

On 8/17/11 6:51 PM, william.colen@gmail.com wrote:
> Hi,
>
> Would it be useful to have detailed output from FMeasure while using span
> with types? For example, we should use it to know individual precision and
> recall for person, organization, date in a NameFinder model or for Chunker.
> Something the output from
> CONLL2000<http://www.cnts.ua.ac.be/conll2000/chunking/output.html>
> :
>
>     processed 961 tokens with 459 phrases; found: 539 phrases; correct: 371.
>     accuracy:  84.08%; precision:  68.83%; recall:  80.83%; FB1:  74.35
>                  ADJP: precision:   0.00%; recall:   0.00%; FB1:   0.00
>                  ADVP: precision:  45.45%; recall:  62.50%; FB1:  52.63
>                    NP: precision:  64.98%; recall:  78.63%; FB1:  71.16
>                    PP: precision:  83.18%; recall:  98.89%; FB1:  90.36
>                  SBAR: precision:  66.67%; recall:  33.33%; FB1:  44.44
>                    VP: precision:  69.00%; recall:  79.31%; FB1:  73.80
>
> I will need something like that for my master dissertation. If it is useful
> I would add it to OpenNLP.
>
> Thanks,
> William
>

Re: Detailed FMeasure output

Posted by Jörn Kottmann <ko...@gmail.com>.

Yes, I think that would be useful. We could create a command line
reporter which can print these statistics, at least to start with.

Can that be done with the new listener interface we just created for
our evaluators? If not I suggest that we might rename it, and add also
a method for correctly classified samples to it, or indicate that with a 
flag.

Jörn

On 8/17/11 6:51 PM, william.colen@gmail.com wrote:
> Hi,
>
> Would it be useful to have detailed output from FMeasure while using span
> with types? For example, we should use it to know individual precision and
> recall for person, organization, date in a NameFinder model or for Chunker.
> Something the output from
> CONLL2000<http://www.cnts.ua.ac.be/conll2000/chunking/output.html>
> :
>
>     processed 961 tokens with 459 phrases; found: 539 phrases; correct: 371.
>     accuracy:  84.08%; precision:  68.83%; recall:  80.83%; FB1:  74.35
>                  ADJP: precision:   0.00%; recall:   0.00%; FB1:   0.00
>                  ADVP: precision:  45.45%; recall:  62.50%; FB1:  52.63
>                    NP: precision:  64.98%; recall:  78.63%; FB1:  71.16
>                    PP: precision:  83.18%; recall:  98.89%; FB1:  90.36
>                  SBAR: precision:  66.67%; recall:  33.33%; FB1:  44.44
>                    VP: precision:  69.00%; recall:  79.31%; FB1:  73.80
>
> I will need something like that for my master dissertation. If it is useful
> I would add it to OpenNLP.
>
> Thanks,
> William
>