You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Martin Wiesner (Jira)" <ji...@apache.org> on 2023/09/01 15:54:00 UTC
[jira] [Updated] (OPENNLP-1512) Fix incorrect encoding used in Conll02NameSampleStream
[ https://issues.apache.org/jira/browse/OPENNLP-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Wiesner updated OPENNLP-1512:
------------------------------------
Description:
While working on OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: [https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002]
{{I ran }}
{{opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt}}
When I checked the output corpus (txt) file, I noticed incorrect symbols being written there.
A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent).
Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in ISO_8859_1.
With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file.
was:
While working on OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: [https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002]
I ran
opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt
When I checked the output corpus (txt) file, I noticed incorrect symbols being written there.
A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent).
Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in ISO_8859_1.
With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file.
> Fix incorrect encoding used in Conll02NameSampleStream
> ------------------------------------------------------
>
> Key: OPENNLP-1512
> URL: https://issues.apache.org/jira/browse/OPENNLP-1512
> Project: OpenNLP
> Issue Type: Improvement
> Components: Formats, Name Finder
> Affects Versions: 2.3.0
> Reporter: Martin Wiesner
> Assignee: Martin Wiesner
> Priority: Minor
> Fix For: 2.3.1
>
>
> While working on OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: [https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002]
> {{I ran }}
> {{opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt}}
> When I checked the output corpus (txt) file, I noticed incorrect symbols being written there.
> A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent).
> Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in ISO_8859_1.
> With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)