You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Aliaksandr Autayeu (Commented) (JIRA)" <ji...@apache.org> on 2011/11/12 16:36:51 UTC
[jira] [Commented] (OPENNLP-371) Confusing error message in tokenizer training

    [ https://issues.apache.org/jira/browse/OPENNLP-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149095#comment-13149095 ] 

Aliaksandr Autayeu commented on OPENNLP-371:
--------------------------------------------

To reproduce the error message: remove <SPLIT>s from token.train and run TokenizerMETest
                
> Confusing error message in tokenizer training
> ---------------------------------------------
>
>                 Key: OPENNLP-371
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-371
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: tools-1.5.3-incubating
>            Reporter: Aliaksandr Autayeu
>            Priority: Minor
>              Labels: model, tokenizer, training
>
> The following error message
> java.lang.IllegalArgumentException: The maxent model is not compatible with the tokenizer!
> 	at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
> 	at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:73)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:267)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:231)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:293)
> 	at opennlp.tools.tokenize.TokenizerTestUtil.createMaxentTokenModel(TokenizerTestUtil.java:67)
> 	at opennlp.tools.tokenize.TokenizerMETest.testTokenizer(TokenizerMETest.java:54)
> ... cut
> might be confusing. 
> Due to error in my conversion tool, I tried to train a tokenizer model on data without <SPLIT>s, which resulted in a model with one outcome only. This model did not pass validation in ModelUtil.validateOutcomes(), which is correct, however, the error message is a bit confusing and it took some time to understood what is going on. 
> I would agree, that a model with different outcomes than expected is incompatible with the tool, but with less outcomes? Is the model with less outcomes than expected really incompatible? For example, with POS tagger I have corpora and models which use a subset of PTB tagset. 
> However, in case of tokenizer this incompatibility makes sense (model with 1 outcome does not work) and in this case the message might be improved to indicate the cause better. Something like: "The maxent model is not compatible with the tokenizer: outcome XXX is not found". 
> Please, advice. Thank you!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira