You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Robert (Jira)" <ji...@apache.org> on 2022/01/17 15:36:00 UTC

[jira] [Created] (OPENNLP-1353) DictonaryLemmatizer missing charset

Robert created OPENNLP-1353:
-------------------------------

             Summary: DictonaryLemmatizer missing charset
                 Key: OPENNLP-1353
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1353
             Project: OpenNLP
          Issue Type: Bug
          Components: Lemmatizer
    Affects Versions: 1.9.3
         Environment: Windows 10
            Reporter: Robert


The initialization of the DictonaryLemmatizer is not decoding the inputstream correctly due to missing charset.

My dictionary file for the lemmatizer is utf-8 encoded. At DictonaryLemmatizer initialization the system fallback encoding is used because no charset is specified for the InputStream. In my case windows-1252. This leads to the problem that the correct lemmas of words are not found.

E.g. My {{lemma.dict}} file contains following line:
mäuse      NN     maus
will be decoded to:
mÃ¤use    NN    maus



--
This message was sent by Atlassian Jira
(v8.20.1#820001)