You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by James Kosin <ja...@gmail.com> on 2011/11/18 02:26:40 UTC

encoding...

As everyone may know I'm on an encoding head hunt... but would like some
feedback on some changes coming soon.

For the CLI only... really.   The CLI often times uses the platforms
default encoding; which may or may not be desirable.  One of the reasons
is that the input or output may become corrupted causing training issues
or even usage issues for the operator.  I'm not sure if the | pipe
operator has the same issues; however a recent check of some converted
files proved that the platform encoding may be undesirable, especially
if the output encoding is unable to handle the input characters from
another encoding.  Internally to the classes and opening and reading
files don't have this issue; so, the libraries themselves are safe.

Any ideas on how we can test these things for the JUnit tests would be a
definite +++



Re: encoding...

Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/18/11 3:56 PM, Aliaksandr Autayeu wrote:
> Please, see my comments below.
>
> On Fri, Nov 18, 2011 at 10:18 AM, Jörn Kottmann<ko...@gmail.com>  wrote:
>
>> On 11/18/11 2:26 AM, James Kosin wrote:
>>
>>> As everyone may know I'm on an encoding head hunt... but would like some
>>> feedback on some changes coming soon.
>>>
>>> For the CLI only... really.   The CLI often times uses the platforms
>>> default encoding; which may or may not be desirable.  One of the reasons
>>>
>> +1. I think that leaving to use platform encoding is fine, given that:
> 1) user is aware, which encoding is being used (might be difficult with
> pipes)
> 2) there is a way to specify exactly which encoding to use
> 3) there is choice of input and output
>
>
>> is that the input or output may become corrupted causing training issues
>>> or even usage issues for the operator.  I'm not sure if the | pipe
>>> operator has the same issues; however a recent check of some converted
>>> files proved that the platform encoding may be undesirable, especially
>>> if the output encoding is unable to handle the input characters from
>>> another encoding.  Internally to the classes and opening and reading
>>> files don't have this issue; so, the libraries themselves are safe.
>>>
>> In my opinion it was just a bad decision to let the format package write
>> the
>> transformed text to standard out.
>> I suggest that we change it and always write to an output file instead.
>>
> I would make that an option, console output might be useful to somebody,
> for example it allows easy combination of tools without having to write to
> disk. But there should always be an option to write to file explicitly (not
> via>) and specify encoding, also explicitly. This might be needed to
> handle difficult cases. It might also be needed to specify different
> encodings for input and output, but this way we might have to deal with
> encoding incompatibility and I would avoid taking this heavy burden on our
> shoulders.


The formats package is there to transform a corpus into the OpenNLP
training format. Since our trainers only accept files as input and the 
formats
package can only write to the console. A user always ends up to redirect
the output of the formats package to a file.

For this reason I think we should change the formats package to always write
to a file. Usually you don't want to see thousands of lines of training 
data on
your console anyway. And if you want its better to use the less tool.

>> We should maybe also echo the encoding to the console, so the user
>> knows which one was used.
>>
> This might intervene with pipes, doesn't it? The same issue that we had
> with the banner.
>
Well, these I would like to remove from the formats package. And it 
depends a bit
you have stdout and stderr. Both are usually redirect separately.

>> Should we also change our small demo tools? There I believe it is confusing
>> when the user uses an encoding and then cannot see the result on the
>> console.
> My understanding is that there should be
> 1) option to use pipes, for input or for output
> 2) option to specify input or output file explicitly
> 3) option to specify input and output encodings explicitly
>
> IMO, this should give enough flexibility, for example, this allows one to
> load the file from disk into the first tool (let say, tokenizer) using some
> custom encoding, then pass it via pipe (without having to slow down the
> process by using disk) to pos tagger, then to parser and then to disk. And
> in case one uses system with wide enough encoding, say Linux with utf8, one
> can avoid specifying encoding, in other cases one always have an option to
> specify an encoding to be safe. A couple of examples:
>
> 1) via files: opennlp Tokenizer -input file-1252.txt -input-encoding
> win1251 -output file-utf8.txt -output-encoding utf8
> 2) via pipes, one should take care of system encoding in the middle, so
> that pipe does not screw anything, suppose system encoding is utf8: opennlp
> Tokenizer -input file-1252.txt -input-encoding win1251 -output-encoding
> utf8 | opennlp POSTagger | opennlp Parser -output file-utf8.txt
>
> Something like this... What do you think?

The whole CLI interface for processing data doesn't really work out and 
was never
designed or intended to do more than a bit testing. The only reason why 
it is there, is
that it served as our documentation how to use OpenNLP for quite some time.

Once in a while there is a user who wants to use OpenNLP to process a 
directory which
contains quite a few files with standard shell commands. Then the shell 
makes one process
per file which is extremely slow because OpenNLP needs to load the 
models for every file,
if you now have a pipeline with a couple of models this can easily take 
30 seconds per file.

I really would like to keep the CLI demo tools as they are. The demo 
tools are these who
can load a model and then process text input from stdin and write the 
output to stdout.

The formats package should be changed to work with files only, and all 
our issues are more
or less solved.

When people want to do more advanced processing they should use our API.
OpenNLP is a Java library, not a tool with a user interface.

Jörn

Re: encoding...

Posted by Aliaksandr Autayeu <al...@autayeu.com>.
Please, see my comments below.

On Fri, Nov 18, 2011 at 10:18 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 11/18/11 2:26 AM, James Kosin wrote:
>
>> As everyone may know I'm on an encoding head hunt... but would like some
>> feedback on some changes coming soon.
>>
>> For the CLI only... really.   The CLI often times uses the platforms
>> default encoding; which may or may not be desirable.  One of the reasons
>>
> +1. I think that leaving to use platform encoding is fine, given that:
1) user is aware, which encoding is being used (might be difficult with
pipes)
2) there is a way to specify exactly which encoding to use
3) there is choice of input and output


> is that the input or output may become corrupted causing training issues
>> or even usage issues for the operator.  I'm not sure if the | pipe
>> operator has the same issues; however a recent check of some converted
>> files proved that the platform encoding may be undesirable, especially
>> if the output encoding is unable to handle the input characters from
>> another encoding.  Internally to the classes and opening and reading
>> files don't have this issue; so, the libraries themselves are safe.
>>
>
> In my opinion it was just a bad decision to let the format package write
> the
> transformed text to standard out.
> I suggest that we change it and always write to an output file instead.
>
I would make that an option, console output might be useful to somebody,
for example it allows easy combination of tools without having to write to
disk. But there should always be an option to write to file explicitly (not
via >) and specify encoding, also explicitly. This might be needed to
handle difficult cases. It might also be needed to specify different
encodings for input and output, but this way we might have to deal with
encoding incompatibility and I would avoid taking this heavy burden on our
shoulders.


>
> We should maybe also echo the encoding to the console, so the user
> knows which one was used.
>
This might intervene with pipes, doesn't it? The same issue that we had
with the banner.


>
> Should we also change our small demo tools? There I believe it is confusing
> when the user uses an encoding and then cannot see the result on the
> console.

My understanding is that there should be
1) option to use pipes, for input or for output
2) option to specify input or output file explicitly
3) option to specify input and output encodings explicitly

IMO, this should give enough flexibility, for example, this allows one to
load the file from disk into the first tool (let say, tokenizer) using some
custom encoding, then pass it via pipe (without having to slow down the
process by using disk) to pos tagger, then to parser and then to disk. And
in case one uses system with wide enough encoding, say Linux with utf8, one
can avoid specifying encoding, in other cases one always have an option to
specify an encoding to be safe. A couple of examples:

1) via files: opennlp Tokenizer -input file-1252.txt -input-encoding
win1251 -output file-utf8.txt -output-encoding utf8
2) via pipes, one should take care of system encoding in the middle, so
that pipe does not screw anything, suppose system encoding is utf8: opennlp
Tokenizer -input file-1252.txt -input-encoding win1251 -output-encoding
utf8 | opennlp POSTagger | opennlp Parser -output file-utf8.txt

Something like this... What do you think?

Aliaksandr

Re: encoding...

Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/18/11 2:26 AM, James Kosin wrote:
> As everyone may know I'm on an encoding head hunt... but would like some
> feedback on some changes coming soon.
>
> For the CLI only... really.   The CLI often times uses the platforms
> default encoding; which may or may not be desirable.  One of the reasons
> is that the input or output may become corrupted causing training issues
> or even usage issues for the operator.  I'm not sure if the | pipe
> operator has the same issues; however a recent check of some converted
> files proved that the platform encoding may be undesirable, especially
> if the output encoding is unable to handle the input characters from
> another encoding.  Internally to the classes and opening and reading
> files don't have this issue; so, the libraries themselves are safe.

In my opinion it was just a bad decision to let the format package write the
transformed text to standard out.
I suggest that we change it and always write to an output file instead.

We should maybe also echo the encoding to the console, so the user
knows which one was used.

Should we also change our small demo tools? There I believe it is confusing
when the user uses an encoding and then cannot see the result on the 
console.

Jörn