You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Peter Thygesen <th...@gmail.com> on 2017/03/12 22:19:57 UTC

Codec classes (BioCodec and BilouCodec)

Hi OpenNLP Developers



After I some weeks ago added a PR for NameFinderSequenceValidator test, I
spent some time looking into the BilouNameFinderSequenceValidator and the
Codec classes, trying to understand how they work and I wrote some more
tests that were missing (OPENNLP-1000)



I have a few comments and questions I hope someone here can answer.



General question:

Why are NameFinder.START and NameFinder.CONTINUE used I the codec classes
and namefindervalidator classes when BioCodec/BilouCodec have their own
tags?

NameFinder should not have them after my opinion. They belong to the Codec
classes.



The BIO doesn’t really match the “Start”, “Continue”, “Other” tags.. even
though its “similar”. And for the BILOU its mixed opennlp+bilou:
Start,Continue,Last,Unit,Other… Some what messy I think. But not BIG thing
for me, and yes BILOU also have many names.. (like BMEWO)



Class: BioCodec

Method: areOutcomesCompatible(String[] outcomes)



I assume that this method is checking that tags in the model are compatible
with opennlp tags (from the codec). And that it receives an array for
unique tags in the model.

Is this true?



The comment in the beginning of the method says: ”… To validate the model
we check if we have one outcome named ”other”…

However, this is not directly checked. The current test will pass outcomes
without “other” if the outcomes have other valid tags.

So, can a model be compatible without an “other” tag?



E.g.

Are outcomes=[“person-start”], [“person-start”, “person-continue”] or
[“person-start”, “location-start”] compatible models?



The method could use Set instead of Lists (yes.. the lists are probably
small for single class models)



Class: BilouCodec

Method: areOutcomesCompatiple(String[] outcomes) implementation is missing.

I made one which you can have and review.



Like the BioCodec I need to know whether a model is compatible without the
“other” tag.



Class: BilouNameFinderValidator

I have written a unit test (OPENNLP-1000) for this class and it shows its
buggy. I will soon push a PR for this.

Failed tests: (should not be valid)

“Start” followed by “Start”

“Start” followed by “Other”

“Start” followed by “Unit”

“Continue” followed by “Start”

“Continue” followed by “Unit”

“Last” followed by “Continue”

“Last” followed by “Last”



I can create some JIRA and PR for my test if you like. (I hope you like :-)


Bgdrs,

Peter Thygesen

Re: Codec classes (BioCodec and BilouCodec)

Posted by Peter Thygesen <th...@gmail.com>.

The references in BioCodec and BilouCodec to NameFinderME tags are easy  to
replace, but it looks difficult to get rid of the NameFinderME.OTHER tags
in DefaultNameContextGenerator.getContext(..) because it is not obvious to
me from which Codec the tags would come from.
Also NameFinderEventStream.genereateOutcomes(..) has references to
NameFinderME tags. Here it looks lige the tags comes from BioCodec. The
method is however deprecated and should probably just be as it is untill
"removed".

/Peter

2017-03-15 9:53 GMT+01:00 Joern Kottmann <ko...@gmail.com>:

> Yes, it would be good to refactor that, adding a base class sounds good.
>
> Jörn
>
> On Tue, Mar 14, 2017 at 11:08 PM, Peter Thygesen <
> thygesen.opennlp@gmail.com
> > wrote:
>
> > Thx for the reply.
> >
> > BTW
> > BioCodec also contains a static method: String extractNameType(...) this
> > method should be move out. Perhaps to a base class called Codec (or
> > CodecBase... or how you normally named base classes) The method is also
> > called by BilouCodec, which then is calling directly to BioCodec...
> (which
> > smells)
> >
> > Perhaps another refactoring task?
> >
> > /Peter
> >
> > 2017-03-14 12:48 GMT+01:00 Joern Kottmann <ko...@gmail.com>:
> >
> > > On Sun, Mar 12, 2017 at 11:19 PM, Peter Thygesen <
> > > thygesen.opennlp@gmail.com
> > > > wrote:
> > >
> > > > Hi OpenNLP Developers
> > > >
> > > >
> > > >
> > > > After I some weeks ago added a PR for NameFinderSequenceValidator
> > test, I
> > > > spent some time looking into the BilouNameFinderSequenceValidator
> and
> > > the
> > > > Codec classes, trying to understand how they work and I wrote some
> more
> > > > tests that were missing (OPENNLP-1000)
> > > >
> > > >
> > > >
> > > > I have a few comments and questions I hope someone here can answer.
> > > >
> > > >
> > > >
> > > > General question:
> > > >
> > > > Why are NameFinder.START and NameFinder.CONTINUE used I the codec
> > classes
> > > > and namefindervalidator classes when BioCodec/BilouCodec have their
> own
> > > > tags?
> > > >
> > > > NameFinder should not have them after my opinion. They belong to the
> > > Codec
> > > > classes.
> > > >
> > >
> > > Yes, that is how it should be. The used to be part of the NameFinder
> but
> > > then were moved to their own class. It would be good to refactor that.
> > >
> > >
> > > >
> > > > The BIO doesn’t really match the “Start”, “Continue”, “Other” tags..
> > even
> > > > though its “similar”. And for the BILOU its mixed opennlp+bilou:
> > > > Start,Continue,Last,Unit,Other… Some what messy I think. But not BIG
> > > thing
> > > > for me, and yes BILOU also have many names.. (like BMEWO)
> > > >
> > >
> > >
> > > The tags were never renamed because that would break backward
> > compatibility
> > > with existing models.
> > >
> > >
> > >
> > > >
> > > > Class: BioCodec
> > > >
> > > > Method: areOutcomesCompatible(String[] outcomes)
> > > >
> > > >
> > > >
> > > > I assume that this method is checking that tags in the model are
> > > compatible
> > > > with opennlp tags (from the codec). And that it receives an array for
> > > > unique tags in the model.
> > > >
> > > > Is this true?
> > > >
> > > >
> > > Yes
> > >
> > >
> > > >
> > > > The comment in the beginning of the method says: ”… To validate the
> > model
> > > > we check if we have one outcome named ”other”…
> > > >
> > > > However, this is not directly checked. The current test will pass
> > > outcomes
> > > > without “other” if the outcomes have other valid tags.
> > > >
> > > > So, can a model be compatible without an “other” tag?
> > > >
> > > >
> > > > E.g.
> > > >
> > > > Are outcomes=[“person-start”], [“person-start”, “person-continue”] or
> > > > [“person-start”, “location-start”] compatible models?
> > > >
> > > >
> > > This could be the case, in practice this all probably doesn't matter
> that
> > > much.
> > > This check should fail if a user somehow ends up with a model which
> > doesn't
> > > come from the name finder, e.g. a  pos tagger maxent model shouldn't be
> > > validated as valid by this method.
> > >
> > >
> > > >
> > > > The method could use Set instead of Lists (yes.. the lists are
> probably
> > > > small for single class models)
> > > >
> > > >
> > > >
> > > > Class: BilouCodec
> > > >
> > > > Method: areOutcomesCompatiple(String[] outcomes) implementation is
> > > missing.
> > > >
> > > > I made one which you can have and review.
> > > >
> > > >
> > > >
> > > This would be great if you could contribute it.
> > >
> > >
> > >
> > > >
> > > > Like the BioCodec I need to know whether a model is compatible
> without
> > > the
> > > > “other” tag.
> > > >
> > > >
> > > I think having a valid model with out other is ok.
> > >
> > >
> > >
> > > >
> > > > Class: BilouNameFinderValidator
> > > >
> > > > I have written a unit test (OPENNLP-1000) for this class and it shows
> > its
> > > > buggy. I will soon push a PR for this.
> > > >
> > > > Failed tests: (should not be valid)
> > > >
> > > > “Start” followed by “Start”
> > > >
> > > > “Start” followed by “Other”
> > > >
> > > > “Start” followed by “Unit”
> > > >
> > > > “Continue” followed by “Start”
> > > >
> > > > “Continue” followed by “Unit”
> > > >
> > > > “Last” followed by “Continue”
> > > >
> > > > “Last” followed by “Last”
> > > >
> > > >
> > > I reviewed your PR and this looks good. I need to run the evaluation
> > tests
> > > once with that PR and then we can merge it.
> > >
> > >
> > >
> > > >
> > > > I can create some JIRA and PR for my test if you like. (I hope you
> like
> > > :-)
> > > >
> > > >
> > >
> > > Please do that and send us more PRs, all you contributions are very
> > welcome
> > > and especially writing tests helps us a lot to improve code quality.
> > >
> > > Jörn
> > >
> >
>

Re: Codec classes (BioCodec and BilouCodec)

Posted by Joern Kottmann <ko...@gmail.com>.

Yes, it would be good to refactor that, adding a base class sounds good.

Jörn

On Tue, Mar 14, 2017 at 11:08 PM, Peter Thygesen <thygesen.opennlp@gmail.com
> wrote:

> Thx for the reply.
>
> BTW
> BioCodec also contains a static method: String extractNameType(...) this
> method should be move out. Perhaps to a base class called Codec (or
> CodecBase... or how you normally named base classes) The method is also
> called by BilouCodec, which then is calling directly to BioCodec... (which
> smells)
>
> Perhaps another refactoring task?
>
> /Peter
>
> 2017-03-14 12:48 GMT+01:00 Joern Kottmann <ko...@gmail.com>:
>
> > On Sun, Mar 12, 2017 at 11:19 PM, Peter Thygesen <
> > thygesen.opennlp@gmail.com
> > > wrote:
> >
> > > Hi OpenNLP Developers
> > >
> > >
> > >
> > > After I some weeks ago added a PR for NameFinderSequenceValidator
> test, I
> > > spent some time looking into the BilouNameFinderSequenceValidator and
> > the
> > > Codec classes, trying to understand how they work and I wrote some more
> > > tests that were missing (OPENNLP-1000)
> > >
> > >
> > >
> > > I have a few comments and questions I hope someone here can answer.
> > >
> > >
> > >
> > > General question:
> > >
> > > Why are NameFinder.START and NameFinder.CONTINUE used I the codec
> classes
> > > and namefindervalidator classes when BioCodec/BilouCodec have their own
> > > tags?
> > >
> > > NameFinder should not have them after my opinion. They belong to the
> > Codec
> > > classes.
> > >
> >
> > Yes, that is how it should be. The used to be part of the NameFinder but
> > then were moved to their own class. It would be good to refactor that.
> >
> >
> > >
> > > The BIO doesn’t really match the “Start”, “Continue”, “Other” tags..
> even
> > > though its “similar”. And for the BILOU its mixed opennlp+bilou:
> > > Start,Continue,Last,Unit,Other… Some what messy I think. But not BIG
> > thing
> > > for me, and yes BILOU also have many names.. (like BMEWO)
> > >
> >
> >
> > The tags were never renamed because that would break backward
> compatibility
> > with existing models.
> >
> >
> >
> > >
> > > Class: BioCodec
> > >
> > > Method: areOutcomesCompatible(String[] outcomes)
> > >
> > >
> > >
> > > I assume that this method is checking that tags in the model are
> > compatible
> > > with opennlp tags (from the codec). And that it receives an array for
> > > unique tags in the model.
> > >
> > > Is this true?
> > >
> > >
> > Yes
> >
> >
> > >
> > > The comment in the beginning of the method says: ”… To validate the
> model
> > > we check if we have one outcome named ”other”…
> > >
> > > However, this is not directly checked. The current test will pass
> > outcomes
> > > without “other” if the outcomes have other valid tags.
> > >
> > > So, can a model be compatible without an “other” tag?
> > >
> > >
> > > E.g.
> > >
> > > Are outcomes=[“person-start”], [“person-start”, “person-continue”] or
> > > [“person-start”, “location-start”] compatible models?
> > >
> > >
> > This could be the case, in practice this all probably doesn't matter that
> > much.
> > This check should fail if a user somehow ends up with a model which
> doesn't
> > come from the name finder, e.g. a  pos tagger maxent model shouldn't be
> > validated as valid by this method.
> >
> >
> > >
> > > The method could use Set instead of Lists (yes.. the lists are probably
> > > small for single class models)
> > >
> > >
> > >
> > > Class: BilouCodec
> > >
> > > Method: areOutcomesCompatiple(String[] outcomes) implementation is
> > missing.
> > >
> > > I made one which you can have and review.
> > >
> > >
> > >
> > This would be great if you could contribute it.
> >
> >
> >
> > >
> > > Like the BioCodec I need to know whether a model is compatible without
> > the
> > > “other” tag.
> > >
> > >
> > I think having a valid model with out other is ok.
> >
> >
> >
> > >
> > > Class: BilouNameFinderValidator
> > >
> > > I have written a unit test (OPENNLP-1000) for this class and it shows
> its
> > > buggy. I will soon push a PR for this.
> > >
> > > Failed tests: (should not be valid)
> > >
> > > “Start” followed by “Start”
> > >
> > > “Start” followed by “Other”
> > >
> > > “Start” followed by “Unit”
> > >
> > > “Continue” followed by “Start”
> > >
> > > “Continue” followed by “Unit”
> > >
> > > “Last” followed by “Continue”
> > >
> > > “Last” followed by “Last”
> > >
> > >
> > I reviewed your PR and this looks good. I need to run the evaluation
> tests
> > once with that PR and then we can merge it.
> >
> >
> >
> > >
> > > I can create some JIRA and PR for my test if you like. (I hope you like
> > :-)
> > >
> > >
> >
> > Please do that and send us more PRs, all you contributions are very
> welcome
> > and especially writing tests helps us a lot to improve code quality.
> >
> > Jörn
> >
>

Re: Codec classes (BioCodec and BilouCodec)

Posted by Peter Thygesen <th...@gmail.com>.

Thx for the reply.

BTW
BioCodec also contains a static method: String extractNameType(...) this
method should be move out. Perhaps to a base class called Codec (or
CodecBase... or how you normally named base classes) The method is also
called by BilouCodec, which then is calling directly to BioCodec... (which
smells)

Perhaps another refactoring task?

/Peter

2017-03-14 12:48 GMT+01:00 Joern Kottmann <ko...@gmail.com>:

> On Sun, Mar 12, 2017 at 11:19 PM, Peter Thygesen <
> thygesen.opennlp@gmail.com
> > wrote:
>
> > Hi OpenNLP Developers
> >
> >
> >
> > After I some weeks ago added a PR for NameFinderSequenceValidator test, I
> > spent some time looking into the BilouNameFinderSequenceValidator and
> the
> > Codec classes, trying to understand how they work and I wrote some more
> > tests that were missing (OPENNLP-1000)
> >
> >
> >
> > I have a few comments and questions I hope someone here can answer.
> >
> >
> >
> > General question:
> >
> > Why are NameFinder.START and NameFinder.CONTINUE used I the codec classes
> > and namefindervalidator classes when BioCodec/BilouCodec have their own
> > tags?
> >
> > NameFinder should not have them after my opinion. They belong to the
> Codec
> > classes.
> >
>
> Yes, that is how it should be. The used to be part of the NameFinder but
> then were moved to their own class. It would be good to refactor that.
>
>
> >
> > The BIO doesn’t really match the “Start”, “Continue”, “Other” tags.. even
> > though its “similar”. And for the BILOU its mixed opennlp+bilou:
> > Start,Continue,Last,Unit,Other… Some what messy I think. But not BIG
> thing
> > for me, and yes BILOU also have many names.. (like BMEWO)
> >
>
>
> The tags were never renamed because that would break backward compatibility
> with existing models.
>
>
>
> >
> > Class: BioCodec
> >
> > Method: areOutcomesCompatible(String[] outcomes)
> >
> >
> >
> > I assume that this method is checking that tags in the model are
> compatible
> > with opennlp tags (from the codec). And that it receives an array for
> > unique tags in the model.
> >
> > Is this true?
> >
> >
> Yes
>
>
> >
> > The comment in the beginning of the method says: ”… To validate the model
> > we check if we have one outcome named ”other”…
> >
> > However, this is not directly checked. The current test will pass
> outcomes
> > without “other” if the outcomes have other valid tags.
> >
> > So, can a model be compatible without an “other” tag?
> >
> >
> > E.g.
> >
> > Are outcomes=[“person-start”], [“person-start”, “person-continue”] or
> > [“person-start”, “location-start”] compatible models?
> >
> >
> This could be the case, in practice this all probably doesn't matter that
> much.
> This check should fail if a user somehow ends up with a model which doesn't
> come from the name finder, e.g. a  pos tagger maxent model shouldn't be
> validated as valid by this method.
>
>
> >
> > The method could use Set instead of Lists (yes.. the lists are probably
> > small for single class models)
> >
> >
> >
> > Class: BilouCodec
> >
> > Method: areOutcomesCompatiple(String[] outcomes) implementation is
> missing.
> >
> > I made one which you can have and review.
> >
> >
> >
> This would be great if you could contribute it.
>
>
>
> >
> > Like the BioCodec I need to know whether a model is compatible without
> the
> > “other” tag.
> >
> >
> I think having a valid model with out other is ok.
>
>
>
> >
> > Class: BilouNameFinderValidator
> >
> > I have written a unit test (OPENNLP-1000) for this class and it shows its
> > buggy. I will soon push a PR for this.
> >
> > Failed tests: (should not be valid)
> >
> > “Start” followed by “Start”
> >
> > “Start” followed by “Other”
> >
> > “Start” followed by “Unit”
> >
> > “Continue” followed by “Start”
> >
> > “Continue” followed by “Unit”
> >
> > “Last” followed by “Continue”
> >
> > “Last” followed by “Last”
> >
> >
> I reviewed your PR and this looks good. I need to run the evaluation tests
> once with that PR and then we can merge it.
>
>
>
> >
> > I can create some JIRA and PR for my test if you like. (I hope you like
> :-)
> >
> >
>
> Please do that and send us more PRs, all you contributions are very welcome
> and especially writing tests helps us a lot to improve code quality.
>
> Jörn
>

Re: Codec classes (BioCodec and BilouCodec)

Posted by Joern Kottmann <ko...@gmail.com>.

On Sun, Mar 12, 2017 at 11:19 PM, Peter Thygesen <thygesen.opennlp@gmail.com
> wrote:

> Hi OpenNLP Developers
>
>
>
> After I some weeks ago added a PR for NameFinderSequenceValidator test, I
> spent some time looking into the BilouNameFinderSequenceValidator and the
> Codec classes, trying to understand how they work and I wrote some more
> tests that were missing (OPENNLP-1000)
>
>
>
> I have a few comments and questions I hope someone here can answer.
>
>
>
> General question:
>
> Why are NameFinder.START and NameFinder.CONTINUE used I the codec classes
> and namefindervalidator classes when BioCodec/BilouCodec have their own
> tags?
>
> NameFinder should not have them after my opinion. They belong to the Codec
> classes.
>

Yes, that is how it should be. The used to be part of the NameFinder but
then were moved to their own class. It would be good to refactor that.


>
> The BIO doesn’t really match the “Start”, “Continue”, “Other” tags.. even
> though its “similar”. And for the BILOU its mixed opennlp+bilou:
> Start,Continue,Last,Unit,Other… Some what messy I think. But not BIG thing
> for me, and yes BILOU also have many names.. (like BMEWO)
>


The tags were never renamed because that would break backward compatibility
with existing models.



>
> Class: BioCodec
>
> Method: areOutcomesCompatible(String[] outcomes)
>
>
>
> I assume that this method is checking that tags in the model are compatible
> with opennlp tags (from the codec). And that it receives an array for
> unique tags in the model.
>
> Is this true?
>
>
Yes


>
> The comment in the beginning of the method says: ”… To validate the model
> we check if we have one outcome named ”other”…
>
> However, this is not directly checked. The current test will pass outcomes
> without “other” if the outcomes have other valid tags.
>
> So, can a model be compatible without an “other” tag?
>
>
> E.g.
>
> Are outcomes=[“person-start”], [“person-start”, “person-continue”] or
> [“person-start”, “location-start”] compatible models?
>
>
This could be the case, in practice this all probably doesn't matter that
much.
This check should fail if a user somehow ends up with a model which doesn't
come from the name finder, e.g. a  pos tagger maxent model shouldn't be
validated as valid by this method.


>
> The method could use Set instead of Lists (yes.. the lists are probably
> small for single class models)
>
>
>
> Class: BilouCodec
>
> Method: areOutcomesCompatiple(String[] outcomes) implementation is missing.
>
> I made one which you can have and review.
>
>
>
This would be great if you could contribute it.



>
> Like the BioCodec I need to know whether a model is compatible without the
> “other” tag.
>
>
I think having a valid model with out other is ok.



>
> Class: BilouNameFinderValidator
>
> I have written a unit test (OPENNLP-1000) for this class and it shows its
> buggy. I will soon push a PR for this.
>
> Failed tests: (should not be valid)
>
> “Start” followed by “Start”
>
> “Start” followed by “Other”
>
> “Start” followed by “Unit”
>
> “Continue” followed by “Start”
>
> “Continue” followed by “Unit”
>
> “Last” followed by “Continue”
>
> “Last” followed by “Last”
>
>
I reviewed your PR and this looks good. I need to run the evaluation tests
once with that PR and then we can merge it.



>
> I can create some JIRA and PR for my test if you like. (I hope you like :-)
>
>

Please do that and send us more PRs, all you contributions are very welcome
and especially writing tests helps us a lot to improve code quality.

Jörn