You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "James Kosin (Commented) (JIRA)" <ji...@apache.org> on 2012/01/20 05:31:39 UTC

[jira] [Commented] (OPENNLP-367) File Encoding Issues

    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189617#comment-13189617 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

I've worked hard on this to be sure everything is covered.

Sorry it took so long on the last one.  The German data seems to be in UTF-8 and the English data for CONLL 03 seems to like both the ISO flavor and the UTF-8 flavor.  I've changed to default both to UTF-8.

Future... platform default encodings just don't cut it in our business.  Windows uses one encoding, Mac another, and some IDEs yet another when debugging; so, everyone needs to watch this.

I currently have scripts setup to train and test what data I've been able to find for CONLL X, 02 and thanks to Jorn the complete 03 datasets.

I'll be posting new performance measurements for all these for the next release.

                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>             Fix For: tools-1.5.3-incubating
>
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira