You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "James Kosin (Created) (JIRA)" <ji...@apache.org> on 2011/11/10 05:09:52 UTC

[jira] [Created] (OPENNLP-367) File Encoding Issues

File Encoding Issues
--------------------

                 Key: OPENNLP-367
                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
             Project: OpenNLP
          Issue Type: Bug
          Components: Command Line Interface
    Affects Versions: tools-1.5.2-incubating
         Environment: All
            Reporter: James Kosin
            Assignee: James Kosin


The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.

We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.

I'll work on fixing this for the next major release...  :-)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148167#comment-13148167 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

Joern,
That is one of the issues.  I was looking at the code and it looks like someone has taken out the gets to get the encoding from the CLI.  The CoNLL 2002 code and the CoNLL 2003 code now has hard coded encodings when opening the files....  and I think I may have fixed one issue which you had with the CoNLL 2002 data encoding by specifying the -Dfile.encoding=UTF-8 may have fixed the System.out issue with encoding.  Just didn't realize it at the time.
Anyway, I just want to put this issue to bed once and for all by encapsolating the file open/reading/and etc into a class and refactor.  So we don't have to remember we need to do this and this and that for every new addition.
I was planing on first determining why everything isn't working...  Which may just be a Windows thing since Linux is leaning more these days to a UTF-8 encoding for the entire OS.

Also, I always convert from the original sources whenever possible when doing my tests.  For example, I have the 1 file eng.train, eng.testa and eng.testb that haven't been converted for the English 2003 data.  I've added the CoNLL 2002 data that hasn't been converted either.  This way I can test most of the system for the NameFinder.

                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151773#comment-13151773 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

I did testing with the CoNLL 02 data and the encoding is working now without the -Dfile.encoding=UTF-8 ... we can document that as a possible workaround until it is fixed.

I also have to research the areas where we accept the file piped or redirected to the parsers and tokenizers on the CLI.

                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148170#comment-13148170 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

When I took out the line with the -Dfile.encoding the latest also breaks with the same errors when training.  I'll have to pull a file before the system deletes the traing data to be sure that the System.out is what may be causing; however, we also use the System.in when the user inputs files.  So, I was going to take care of all these situations.
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189617#comment-13189617 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

I've worked hard on this to be sure everything is covered.

Sorry it took so long on the last one.  The German data seems to be in UTF-8 and the English data for CONLL 03 seems to like both the ISO flavor and the UTF-8 flavor.  I've changed to default both to UTF-8.

Future... platform default encodings just don't cut it in our business.  Windows uses one encoding, Mac another, and some IDEs yet another when debugging; so, everyone needs to watch this.

I currently have scripts setup to train and test what data I've been able to find for CONLL X, 02 and thanks to Jorn the complete 03 datasets.

I'll be posting new performance measurements for all these for the next release.

                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>             Fix For: tools-1.5.3-incubating
>
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152547#comment-13152547 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

I think I have all the converters...  anyone see any that are using the default system encoding for the input or output let me know, or submit a patch to this.

I'm going to ask on the dev list now on weather we need encoding on the input / output streams for the tools that are expecting to pipe the output to a file or to another model as in the examples.  It might have been nice to be able to get a class setup.  But for now we just have the System.setOut() and System.setIn() functions to change the encoding.


                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148916#comment-13148916 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

Maybe then all we need is to go through the places we use the < (input) and > (output) redirectors on the command prompt and start considering using:

  System.setOut(new PrintStream(System.out, true, "our-favorite encoding"));

and

  System.setIn(...) // haven't figured this one yet...
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Kosin updated OPENNLP-367:
--------------------------------

    Remaining Estimate: 672h
     Original Estimate: 672h

Added an estimate of the work needed.
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151769#comment-13151769 ] 

James Kosin commented on OPENNLP-367:
-------------------------------------

I found the CoNLL-X data is all in UTF-8; so I removed the parameter for this in the factory.
I also added a comment in the ConllX stream.  The encoding is being set in the factory for this conversion... for some reason, and not in the same place as the other classes.
I also downloaded the data for the CoNLL-X free data set to test and implement the models at some point.  It covers 4 languages.

I'm not done on this one..
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Kosin closed OPENNLP-367.
-------------------------------

       Resolution: Fixed
    Fix Version/s: tools-1.5.3-incubating

I'm closing this as fixed now.
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>             Fix For: tools-1.5.3-incubating
>
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147599#comment-13147599 ] 

Joern Kottmann commented on OPENNLP-367:
----------------------------------------

Can you be more specific why this training does not work? To train with UTF-8 you need to specify this parameter"-encoding UTF-8". Doesn't that work? 
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (OPENNLP-367) File Encoding Issues

Posted by "James Kosin (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Kosin updated OPENNLP-367:
--------------------------------

    Attachment: encoding.patch

I've attached a small patch to only a few files to get everyone's opinion on the problem.  I say small because some of the converters are either relying on the default system encoding...
Anyway.  What the 3-4 files patched here do are (1) set a new System.out printer with a new encoding ... I've specified the same as the input encoding described for the class.  (2) you will notice one of the files ConllXPOS... is using the default system level encoding by using a PlainTextByLine(in) instead of the other PlainTextByLine(in, "encoding").

Basically, I need to review all the encoding usages and try to determine if the are all proper.  Some may be and some may need to be adjusted.

Just trying to give everyone a heads up on the issue.

                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148350#comment-13148350 ] 

Joern Kottmann commented on OPENNLP-367:
----------------------------------------

The CONLL02 data has always the same encoding and can be considered to be part of the data format. I think that was the reason why we hard coded them.
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-367) File Encoding Issues

Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151118#comment-13151118 ] 

Joern Kottmann commented on OPENNLP-367:
----------------------------------------

The transformed data from the formats package should be written to an output file, where the user can also specify the encoding. The command line interface might work slightly different on different platforms, and might as well be confusing to use when used with data which cannot be encoded in the platform default encoding.

The encoding in ConllXPOSSampleStream should either be hard coded or passed in, but we should not use platform default.
Hardcoding to the encoding the CONLL-X data is distributed in, should be ok.
                
> File Encoding Issues
> --------------------
>
>                 Key: OPENNLP-367
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-367
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Command Line Interface
>    Affects Versions: tools-1.5.2-incubating
>         Environment: All
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: encoding, rework, training
>         Attachments: encoding.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly handled.  A good example is the CoNLL 2002 data if correctly encoded in UTF-8 does not correctly work for training without specifying -Dfile.encoding=UTF-8 for the Java Command.
> We already specify the input and expected output encoding on the cmdline interface with the -encoding paramter.  For some reason this isn't being followed.
> I'll work on fixing this for the next major release...  :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira