You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by James Kosin <ja...@gmail.com> on 2011/11/10 05:36:35 UTC
Encoding Issues
Everyone,
Me again. I'm going to be refactoring a lot of the file handling to
abstract away the encoding and making it a bit more seamless so everyone
doesn't have to always remember to do this or do that. Basically, what
I'm proposing is something like this.
1) A new class called EncodedFile that everyone will have to use when
opening and reading data from a file. Much like a Steam object or what
we already do... Only it will be one class handling the input/output for
the files.
2) This class will also provide methods to get a output and input steams
like the stdio System.out and System.in variables; or be able to replace
them with new ones that have the correct encoding specified.
3) We may also want to be able to specify the input and output encoding
separately... So, I'll be adding some of that; however, the first
version may only be able to support one for both initially.
Let me know if anyone wants anything else added to this list.
Thanks,
James
Re: Encoding Issues
Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/10/11 9:35 PM, william.colen@gmail.com wrote:
> Also, do you think it would be interesting if we use Apache Commons
> libraries? Maybe there are a ready to use solution there.
>
Then we need to depend on it, and the current solution is good,
especially after your refactoring for 1.5.2.
Jörn
Re: Encoding Issues
Posted by "william.colen@gmail.com" <wi...@gmail.com>.
Also, do you think it would be interesting if we use Apache Commons
libraries? Maybe there are a ready to use solution there.
On Thu, Nov 10, 2011 at 3:59 PM, Aliaksandr Autayeu
<al...@autayeu.com>wrote:
> Sounds interesting. And I would be cautions to avoid reinventing the wheel
> - the standard Java way is quite good. But may be I don't understand the
> code or your proposal well enough yet. James, may be before jumping into
> it, you can make a small before-after sample piece of code to illustrate
> better your idea? A snap of code before, a snap of code after. And a snap
> of "client" code before and after? What do you think?
>
> regards,
> Aliaksandr
>
> On Thu, Nov 10, 2011 at 5:36 AM, James Kosin <ja...@gmail.com>
> wrote:
>
> > Everyone,
> >
> > Me again. I'm going to be refactoring a lot of the file handling to
> > abstract away the encoding and making it a bit more seamless so everyone
> > doesn't have to always remember to do this or do that. Basically, what
> I'm
> > proposing is something like this.
> >
> > 1) A new class called EncodedFile that everyone will have to use when
> > opening and reading data from a file. Much like a Steam object or what
> we
> > already do... Only it will be one class handling the input/output for the
> > files.
> >
> > 2) This class will also provide methods to get a output and input steams
> > like the stdio System.out and System.in variables; or be able to replace
> > them with new ones that have the correct encoding specified.
> >
> > 3) We may also want to be able to specify the input and output encoding
> > separately... So, I'll be adding some of that; however, the first version
> > may only be able to support one for both initially.
> >
> > Let me know if anyone wants anything else added to this list.
> >
> > Thanks,
> > James
> >
>
Re: Encoding Issues
Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/10/11 6:59 PM, Aliaksandr Autayeu wrote:
> Sounds interesting. And I would be cautions to avoid reinventing the wheel
> - the standard Java way is quite good. But may be I don't understand the
> code or your proposal well enough yet. James, may be before jumping into
> it, you can make a small before-after sample piece of code to illustrate
> better your idea? A snap of code before, a snap of code after. And a snap
> of "client" code before and after? What do you think?
We don't really have encoding issues in OpenNLP because the whole API
relies on strings and strings are always UTF-16 in java. The only place
where we need to deal with encoding is in our command line interface where
we read in training data, have tools to transform data, evaluation tools and
demo tools which read in plain text from the console.
Jörn
Re: Encoding Issues
Posted by Aliaksandr Autayeu <al...@autayeu.com>.
Sounds interesting. And I would be cautions to avoid reinventing the wheel
- the standard Java way is quite good. But may be I don't understand the
code or your proposal well enough yet. James, may be before jumping into
it, you can make a small before-after sample piece of code to illustrate
better your idea? A snap of code before, a snap of code after. And a snap
of "client" code before and after? What do you think?
regards,
Aliaksandr
On Thu, Nov 10, 2011 at 5:36 AM, James Kosin <ja...@gmail.com> wrote:
> Everyone,
>
> Me again. I'm going to be refactoring a lot of the file handling to
> abstract away the encoding and making it a bit more seamless so everyone
> doesn't have to always remember to do this or do that. Basically, what I'm
> proposing is something like this.
>
> 1) A new class called EncodedFile that everyone will have to use when
> opening and reading data from a file. Much like a Steam object or what we
> already do... Only it will be one class handling the input/output for the
> files.
>
> 2) This class will also provide methods to get a output and input steams
> like the stdio System.out and System.in variables; or be able to replace
> them with new ones that have the correct encoding specified.
>
> 3) We may also want to be able to specify the input and output encoding
> separately... So, I'll be adding some of that; however, the first version
> may only be able to support one for both initially.
>
> Let me know if anyone wants anything else added to this list.
>
> Thanks,
> James
>
Re: Encoding Issues
Posted by Jörn Kottmann <ko...@gmail.com>.
On 11/10/11 5:36 AM, James Kosin wrote:
> Everyone,
>
> Me again. I'm going to be refactoring a lot of the file handling to
> abstract away the encoding and making it a bit more seamless so
> everyone doesn't have to always remember to do this or do that.
> Basically, what I'm proposing is something like this.
>
Isn't it already very simple and following the standard java way of
doing it? When you want to read or write a String
you need to provide an encoding. I can't see how we can make this easier.
The only issue I see it that we don't do that for the tools in the
format package, and for the tagging tools,
which directly read and output to the stdin and stdout.
The format package should be changed to write to a file directly instead
to stdout and use an encoding parameter for that.
I am not sure if we should update our tagging tools, if we use a
different encoding than the default the console will fail to display to
output
text correctly.
> 1) A new class called EncodedFile that everyone will have to use when
> opening and reading data from a file. Much like a Steam object or
> what we already do... Only it will be one class handling the
> input/output for the files.
>
> 2) This class will also provide methods to get a output and input
> steams like the stdio System.out and System.in variables; or be able
> to replace them with new ones that have the correct encoding specified.
>
> 3) We may also want to be able to specify the input and output
> encoding separately... So, I'll be adding some of that; however, the
> first version may only be able to support one for both initially.
With the cmd line parsing tool, these things are already easier, because
it can give you directly a File or Charset object
for a command line argument. Maybe there is even more we can do to
further simplify that.
Jörn