You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Aliaksandr Autayeu (Created) (JIRA)" <ji...@apache.org> on 2011/11/12 16:34:51 UTC

[jira] [Created] (OPENNLP-371) Confusing error message in tokenizer training

Confusing error message in tokenizer training
---------------------------------------------

                 Key: OPENNLP-371
                 URL: https://issues.apache.org/jira/browse/OPENNLP-371
             Project: OpenNLP
          Issue Type: Improvement
          Components: Tokenizer
    Affects Versions: tools-1.5.3-incubating
            Reporter: Aliaksandr Autayeu
            Priority: Minor


The following error message

java.lang.IllegalArgumentException: The maxent model is not compatible with the tokenizer!
	at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
	at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:73)
	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:267)
	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:231)
	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:293)
	at opennlp.tools.tokenize.TokenizerTestUtil.createMaxentTokenModel(TokenizerTestUtil.java:67)
	at opennlp.tools.tokenize.TokenizerMETest.testTokenizer(TokenizerMETest.java:54)
... cut

might be confusing. 

Due to error in my conversion tool, I tried to train a tokenizer model on data without <SPLIT>s, which resulted in a model with one outcome only. This model did not pass validation in ModelUtil.validateOutcomes(), which is correct, however, the error message is a bit confusing and it took some time to understood what is going on. 

I would agree, that a model with different outcomes than expected is incompatible with the tool, but with less outcomes? Is the model with less outcomes than expected really incompatible? For example, with POS tagger I have corpora and models which use a subset of PTB tagset. 

However, in case of tokenizer this incompatibility makes sense (model with 1 outcome does not work) and in this case the message might be improved to indicate the cause better. Something like: "The maxent model is not compatible with the tokenizer: outcome XXX is not found". 

Please, advice. Thank you!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-371) Confusing error message in tokenizer training

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149125#comment-13149125 ] 

James Kosin commented on OPENNLP-371:
-------------------------------------

Actually better documentation or getting the trainer to catch situations like this would be nice.

                
> Confusing error message in tokenizer training
> ---------------------------------------------
>
>                 Key: OPENNLP-371
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-371
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: tools-1.5.3-incubating
>            Reporter: Aliaksandr Autayeu
>            Priority: Minor
>              Labels: model, tokenizer, training
>
> The following error message
> java.lang.IllegalArgumentException: The maxent model is not compatible with the tokenizer!
> 	at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
> 	at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:73)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:267)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:231)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:293)
> 	at opennlp.tools.tokenize.TokenizerTestUtil.createMaxentTokenModel(TokenizerTestUtil.java:67)
> 	at opennlp.tools.tokenize.TokenizerMETest.testTokenizer(TokenizerMETest.java:54)
> ... cut
> might be confusing. 
> Due to error in my conversion tool, I tried to train a tokenizer model on data without <SPLIT>s, which resulted in a model with one outcome only. This model did not pass validation in ModelUtil.validateOutcomes(), which is correct, however, the error message is a bit confusing and it took some time to understood what is going on. 
> I would agree, that a model with different outcomes than expected is incompatible with the tool, but with less outcomes? Is the model with less outcomes than expected really incompatible? For example, with POS tagger I have corpora and models which use a subset of PTB tagset. 
> However, in case of tokenizer this incompatibility makes sense (model with 1 outcome does not work) and in this case the message might be improved to indicate the cause better. Something like: "The maxent model is not compatible with the tokenizer: outcome XXX is not found". 
> Please, advice. Thank you!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-371) Confusing error message in tokenizer training

Posted by "Aliaksandr Autayeu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149095#comment-13149095 ] 

Aliaksandr Autayeu commented on OPENNLP-371:
--------------------------------------------

To reproduce the error message: remove <SPLIT>s from token.train and run TokenizerMETest
                
> Confusing error message in tokenizer training
> ---------------------------------------------
>
>                 Key: OPENNLP-371
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-371
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: tools-1.5.3-incubating
>            Reporter: Aliaksandr Autayeu
>            Priority: Minor
>              Labels: model, tokenizer, training
>
> The following error message
> java.lang.IllegalArgumentException: The maxent model is not compatible with the tokenizer!
> 	at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
> 	at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:73)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:267)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:231)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:293)
> 	at opennlp.tools.tokenize.TokenizerTestUtil.createMaxentTokenModel(TokenizerTestUtil.java:67)
> 	at opennlp.tools.tokenize.TokenizerMETest.testTokenizer(TokenizerMETest.java:54)
> ... cut
> might be confusing. 
> Due to error in my conversion tool, I tried to train a tokenizer model on data without <SPLIT>s, which resulted in a model with one outcome only. This model did not pass validation in ModelUtil.validateOutcomes(), which is correct, however, the error message is a bit confusing and it took some time to understood what is going on. 
> I would agree, that a model with different outcomes than expected is incompatible with the tool, but with less outcomes? Is the model with less outcomes than expected really incompatible? For example, with POS tagger I have corpora and models which use a subset of PTB tagset. 
> However, in case of tokenizer this incompatibility makes sense (model with 1 outcome does not work) and in this case the message might be improved to indicate the cause better. Something like: "The maxent model is not compatible with the tokenizer: outcome XXX is not found". 
> Please, advice. Thank you!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-371) Confusing error message in tokenizer training

Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149517#comment-13149517 ] 

Joern Kottmann commented on OPENNLP-371:
----------------------------------------

The trained model is one which will not work at all because it will always make the same decision. As James said, the trainer should recognize that the training data has only one outcome and then report an appropriate error message.
                
> Confusing error message in tokenizer training
> ---------------------------------------------
>
>                 Key: OPENNLP-371
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-371
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: tools-1.5.3-incubating
>            Reporter: Aliaksandr Autayeu
>            Priority: Minor
>              Labels: model, tokenizer, training
>
> The following error message
> java.lang.IllegalArgumentException: The maxent model is not compatible with the tokenizer!
> 	at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
> 	at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:73)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:267)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:231)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:293)
> 	at opennlp.tools.tokenize.TokenizerTestUtil.createMaxentTokenModel(TokenizerTestUtil.java:67)
> 	at opennlp.tools.tokenize.TokenizerMETest.testTokenizer(TokenizerMETest.java:54)
> ... cut
> might be confusing. 
> Due to error in my conversion tool, I tried to train a tokenizer model on data without <SPLIT>s, which resulted in a model with one outcome only. This model did not pass validation in ModelUtil.validateOutcomes(), which is correct, however, the error message is a bit confusing and it took some time to understood what is going on. 
> I would agree, that a model with different outcomes than expected is incompatible with the tool, but with less outcomes? Is the model with less outcomes than expected really incompatible? For example, with POS tagger I have corpora and models which use a subset of PTB tagset. 
> However, in case of tokenizer this incompatibility makes sense (model with 1 outcome does not work) and in this case the message might be improved to indicate the cause better. Something like: "The maxent model is not compatible with the tokenizer: outcome XXX is not found". 
> Please, advice. Thank you!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira