You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2014/01/06 17:01:52 UTC

[jira] [Commented] (OPENNLP-371) Confusing error message in tokenizer training

    [ https://issues.apache.org/jira/browse/OPENNLP-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863059#comment-13863059 ] 

Joern Kottmann commented on OPENNLP-371:
----------------------------------------

We should implement some mechanism to to validate the training data, and if it is not valid directly throw an exception. Most of the components fail in a strange way if they are trained on incorrect training data. Additionally we can try to add some kind of warnings which tell the user that there might be something wrong, e.g. not enough data. not enough annotations, etc. 

> Confusing error message in tokenizer training
> ---------------------------------------------
>
>                 Key: OPENNLP-371
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-371
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: tools-1.5.3
>            Reporter: Aliaksandr Autayeu
>            Priority: Minor
>              Labels: model, tokenizer, training
>
> The following error message
> java.lang.IllegalArgumentException: The maxent model is not compatible with the tokenizer!
> 	at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
> 	at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:73)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:267)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:231)
> 	at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:293)
> 	at opennlp.tools.tokenize.TokenizerTestUtil.createMaxentTokenModel(TokenizerTestUtil.java:67)
> 	at opennlp.tools.tokenize.TokenizerMETest.testTokenizer(TokenizerMETest.java:54)
> ... cut
> might be confusing. 
> Due to error in my conversion tool, I tried to train a tokenizer model on data without <SPLIT>s, which resulted in a model with one outcome only. This model did not pass validation in ModelUtil.validateOutcomes(), which is correct, however, the error message is a bit confusing and it took some time to understood what is going on. 
> I would agree, that a model with different outcomes than expected is incompatible with the tool, but with less outcomes? Is the model with less outcomes than expected really incompatible? For example, with POS tagger I have corpora and models which use a subset of PTB tagset. 
> However, in case of tokenizer this incompatibility makes sense (model with 1 outcome does not work) and in this case the message might be improved to indicate the cause better. Something like: "The maxent model is not compatible with the tokenizer: outcome XXX is not found". 
> Please, advice. Thank you!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)