You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by James Kosin <ja...@gmail.com> on 2011/11/10 05:36:35 UTC

Encoding Issues

Everyone,

Me again.  I'm going to be refactoring a lot of the file handling to 
abstract away the encoding and making it a bit more seamless so everyone 
doesn't have to always remember to do this or do that.  Basically, what 
I'm proposing is something like this.

1)  A new class called EncodedFile that everyone will have to use when 
opening and reading data from a file.  Much like a Steam object or what 
we already do... Only it will be one class handling the input/output for 
the files.

2) This class will also provide methods to get a output and input steams 
like the stdio System.out and System.in variables; or be able to replace 
them with new ones that have the correct encoding specified.

3) We may also want to be able to specify the input and output encoding 
separately... So, I'll be adding some of that; however, the first 
version may only be able to support one for both initially.

Let me know if anyone wants anything else added to this list.

Thanks,
James

Re: Encoding Issues

Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/10/11 9:35 PM, william.colen@gmail.com wrote:
> Also, do you think it would be interesting if we use Apache Commons
> libraries? Maybe there are a ready to use solution there.
>
Then we need to depend on it, and the current solution is good,
especially after your refactoring for 1.5.2.

Jörn

Re: Encoding Issues

Posted by "william.colen@gmail.com" <wi...@gmail.com>.
Also, do you think it would be interesting if we use Apache Commons
libraries? Maybe there are a ready to use solution there.

On Thu, Nov 10, 2011 at 3:59 PM, Aliaksandr Autayeu
<al...@autayeu.com>wrote:

> Sounds interesting. And I would be cautions to avoid reinventing the wheel
> - the standard Java way is quite good. But may be I don't understand the
> code or your proposal well enough yet. James, may be before jumping into
> it, you can make a small before-after sample piece of code to illustrate
> better your idea? A snap of code before, a snap of code after. And a snap
> of "client" code before and after? What do you think?
>
> regards,
> Aliaksandr
>
> On Thu, Nov 10, 2011 at 5:36 AM, James Kosin <ja...@gmail.com>
> wrote:
>
> > Everyone,
> >
> > Me again.  I'm going to be refactoring a lot of the file handling to
> > abstract away the encoding and making it a bit more seamless so everyone
> > doesn't have to always remember to do this or do that.  Basically, what
> I'm
> > proposing is something like this.
> >
> > 1)  A new class called EncodedFile that everyone will have to use when
> > opening and reading data from a file.  Much like a Steam object or what
> we
> > already do... Only it will be one class handling the input/output for the
> > files.
> >
> > 2) This class will also provide methods to get a output and input steams
> > like the stdio System.out and System.in variables; or be able to replace
> > them with new ones that have the correct encoding specified.
> >
> > 3) We may also want to be able to specify the input and output encoding
> > separately... So, I'll be adding some of that; however, the first version
> > may only be able to support one for both initially.
> >
> > Let me know if anyone wants anything else added to this list.
> >
> > Thanks,
> > James
> >
>

Re: Encoding Issues

Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/10/11 6:59 PM, Aliaksandr Autayeu wrote:
> Sounds interesting. And I would be cautions to avoid reinventing the wheel
> - the standard Java way is quite good. But may be I don't understand the
> code or your proposal well enough yet. James, may be before jumping into
> it, you can make a small before-after sample piece of code to illustrate
> better your idea? A snap of code before, a snap of code after. And a snap
> of "client" code before and after? What do you think?

We don't really have encoding issues in OpenNLP because the whole API
relies on strings and strings are always UTF-16 in java. The only place
where we need to deal with encoding is in our command line interface where
we read in training data, have tools to transform data, evaluation tools and
demo tools which read in plain text from the console.

Jörn


Re: Encoding Issues

Posted by Aliaksandr Autayeu <al...@autayeu.com>.
Sounds interesting. And I would be cautions to avoid reinventing the wheel
- the standard Java way is quite good. But may be I don't understand the
code or your proposal well enough yet. James, may be before jumping into
it, you can make a small before-after sample piece of code to illustrate
better your idea? A snap of code before, a snap of code after. And a snap
of "client" code before and after? What do you think?

regards,
Aliaksandr

On Thu, Nov 10, 2011 at 5:36 AM, James Kosin <ja...@gmail.com> wrote:

> Everyone,
>
> Me again.  I'm going to be refactoring a lot of the file handling to
> abstract away the encoding and making it a bit more seamless so everyone
> doesn't have to always remember to do this or do that.  Basically, what I'm
> proposing is something like this.
>
> 1)  A new class called EncodedFile that everyone will have to use when
> opening and reading data from a file.  Much like a Steam object or what we
> already do... Only it will be one class handling the input/output for the
> files.
>
> 2) This class will also provide methods to get a output and input steams
> like the stdio System.out and System.in variables; or be able to replace
> them with new ones that have the correct encoding specified.
>
> 3) We may also want to be able to specify the input and output encoding
> separately... So, I'll be adding some of that; however, the first version
> may only be able to support one for both initially.
>
> Let me know if anyone wants anything else added to this list.
>
> Thanks,
> James
>

Re: Encoding Issues

Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/10/11 5:36 AM, James Kosin wrote:
> Everyone,
>
> Me again.  I'm going to be refactoring a lot of the file handling to 
> abstract away the encoding and making it a bit more seamless so 
> everyone doesn't have to always remember to do this or do that.  
> Basically, what I'm proposing is something like this.
>

Isn't it already very simple and following the standard java way of 
doing it? When you want to read or write a String
you need to provide an encoding. I can't see how we can make this easier.

The only issue I see it that we don't do that for the tools in the 
format package, and for the tagging tools,
which directly read and output to the stdin and stdout.

The format package should be changed to write to a file directly instead 
to stdout and use an encoding parameter for that.

I am not sure if we should update our tagging tools, if we use a 
different encoding than the default the console will fail to display to 
output
text correctly.

> 1)  A new class called EncodedFile that everyone will have to use when 
> opening and reading data from a file.  Much like a Steam object or 
> what we already do... Only it will be one class handling the 
> input/output for the files.
>
> 2) This class will also provide methods to get a output and input 
> steams like the stdio System.out and System.in variables; or be able 
> to replace them with new ones that have the correct encoding specified.
>
> 3) We may also want to be able to specify the input and output 
> encoding separately... So, I'll be adding some of that; however, the 
> first version may only be able to support one for both initially.

With the cmd line parsing tool, these things are already easier, because 
it can give you directly a File or Charset object
for a command line argument. Maybe there is even more we can do to 
further simplify that.

Jörn