You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Richard Eckart de Castilho <re...@apache.org> on 2016/03/01 23:13:52 UTC

OpenNLP maxent model trained with wrong encoding

Hi all,

I noticed that the OpenNLP German POS Tagger maxent model available from Sourceforge has been trained using the wrong encoding setting. Apparently the input data was UTF-8, but it was read as ISO8859-1. The perceptron model is not affected. I only examined NER and POS models, not tokenizer or sentence splitter models.

Best,

-- Richard

Re: OpenNLP maxent model trained with wrong encoding

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi again,

the Spanish and Dutch NER models are also affected, was just a bit more difficult to figure out because the models internally lower-case the features.

Cheers,

-- Richard

> On 01.03.2016, at 23:13, Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> Hi all,
> 
> I noticed that the OpenNLP German POS Tagger maxent model available from Sourceforge has been trained using the wrong encoding setting. Apparently the input data was UTF-8, but it was read as ISO8859-1. The perceptron model is not affected. I only examined NER and POS models, not tokenizer or sentence splitter models.
> 
> Best,
> 
> -- Richard