You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by Charles Lee <li...@gmail.com> on 2009/07/14 11:50:33 UTC

Shall we change our file.encoding

Hi guys:

I am doing some test cases on the ant junit test case and meeting some
encoding problems. I find they are maybe caused by the different default
encoding from RI and harmony. My local is en_US.UTF-8, RI default is UTF-8
but harmony is 8859-1. And then I have encountered
HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
and the two diffs attached on that issue. It seems we always get 8859-1.
Because: (correct me if wrong :-)

1. we remove the set code in the vm. we will always get null if we call vm
method
2. we set the file.encode in the libglob.c, if we got null from vm, we set
8859-1.
3. we can not set file.encode on the run time.

ant use UTF-8 to encode filename which contains the non-ascii character.
So why we use iso8859-1 as our unchangeable default?
>From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In computing
applications, encodings that provide full UCS support (such as
UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing favor
over encodings based on ISO 8859-1." Should we simply change iso8859-1 to
utf-8?

-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Regis <xu...@gmail.com>.

Charles Lee wrote:
> Hi guys:
> 
> I am doing some test cases on the ant junit test case and meeting some
> encoding problems. I find they are maybe caused by the different default
> encoding from RI and harmony. My local is en_US.UTF-8, RI default is UTF-8
> but harmony is 8859-1. And then I have encountered
> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
> and the two diffs attached on that issue. It seems we always get 8859-1.
> Because: (correct me if wrong :-)
> 
> 1. we remove the set code in the vm. we will always get null if we call vm
> method
> 2. we set the file.encode in the libglob.c, if we got null from vm, we set
> 8859-1.
> 3. we can not set file.encode on the run time.
> 
> ant use UTF-8 to encode filename which contains the non-ascii character.
> So why we use iso8859-1 as our unchangeable default?
>>>From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In computing
> applications, encodings that provide full UCS support (such as
> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing favor
> over encodings based on ISO 8859-1." Should we simply change iso8859-1 to
> utf-8?
> 

I suppose default encoding should get from system locale, simply changing 
iso8859-1 to utf-8 doesn't resolve this problem. If vm didn't do this, I think, 
classlib could get locale info from OS and set "file.encode" property properly.


-- 
Best Regards,
Regis.

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Hi,

The original error msg is: HMYEXEL054E Internal VM error: Failed to create
java/lang/String for class name FileEncoding

On Sat, Jul 18, 2009 at 11:10 AM, Nathan Beyer <nd...@apache.org> wrote:

> On Fri, Jul 17, 2009 at 6:03 AM, Alexey
> Varlamov<al...@gmail.com> wrote:
> > 2009/7/17, Nathan Beyer <nd...@apache.org>:
> >> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com>
> wrote:
> >> >>>> Hi Nathan,
> >> >>>>
> >> >>>> What I got is 936, the code page identifier. Is there a api for us
> to map
> >> >>>> 936 to the gb2312?
> >> >>>
> >> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> >> >>> that into a name of some sort. I'll poke around a bit and see what I
> >> >>> can find.
> >> >>
> >> >> We'll probably just have to put in a mapping ourselves based on the
> >> >> documentation. We'd call GetACP [1] and map that to a known alias in
> >> >> java.nio.charset that matches the definitions[2] of the identifiers.
> >> >>
> >> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >> >
> >> > This may be better - APR has a function for getting the OS default
> >> > encoding. This would work across all platforms that APR supports and I
> >> > believe we already use APR.
> >> >
> >> >
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
> >>
> >> However, the Windows version of this is simply - return
> >> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> >> "CP" + codePageId.
> >>
> >> And the Unix version of this method doesn't look very good for our
> purposes.
> >> >
> >> > -Nathan
> >
> > Yep - that's why APR was not used here initially. I guess your idea of
> > GetACP() + hardcoded mapping is the most suitable approach. We already
> > have similar solution for timezone detection, see
> > working_vm\vm\port\src\misc\win\timezone.c (which also should be moved
> > to classlib eventually, HARMONY-2053).
>
> I'd be inclined to combine these all together into the portlib
> (luni?). Perhaps in some sort of OS environment portion, which can be
> used by the rest of the class library.
>
> -Nathan
>
> >
> > --
> > Alexey
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Thanks Nathan,

I have finally passed on the windows and I have created a jira
https://issues.apache.org/jira/browse/HARMONY-6279.

Would anyone want to try this new feature :-)

On Tue, Jul 21, 2009 at 7:44 AM, Nathan Beyer <nd...@apache.org> wrote:

> I don't think the Windows logic will be quite that simple - I think
> we'll have to recreate the mapping defined by the Windows API [1]. In
> the case of 936, we'd convert to gb2312, per [1].
>
> The default value is going to vary on each platform. On Windows, if
> the we can't determine locale information, then we'll default to "en"
> and encoding of "Windows-1252"
>
> -Nathan
>
> [1] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
>
> On Mon, Jul 20, 2009 at 5:05 AM, Charles Lee<li...@gmail.com> wrote:
> > Hi guys,
> >
> > A new patch is attached but still fail on the windows. It *seems* VM do
> not
> > support CP936.
> >
> > 1. I have tried to hard code "CP936" in the luniglob.c, make the
> > file.encoding always be CP936. The vm failed to launch with the message
> > "HMYEXEL054E vm inner fault: can not create java/lang/String, FAILED to
> > invoke JVM" (The original msg is Chinese, I am translating it)
> > 2. I have tried to hard code "UTF-8" in the luniglob.c, make the
> > file.encoding always be UTF-8. The vm sucessfully launch and tests have
> been
> > passed.
> >
> > Does somebody know where the vm load the String? And what does
> "HMYEXEL054E"
> > mean?
> >
> > On Sat, Jul 18, 2009 at 11:10 AM, Nathan Beyer <nd...@apache.org>
> wrote:
> >>
> >> On Fri, Jul 17, 2009 at 6:03 AM, Alexey
> >> Varlamov<al...@gmail.com> wrote:
> >> > 2009/7/17, Nathan Beyer <nd...@apache.org>:
> >> >> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org>
> >> >> wrote:
> >> >> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org>
> >> >> > wrote:
> >> >> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
> >> >> >> wrote:
> >> >> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<
> littlee1032@gmail.com>
> >> >> >>> wrote:
> >> >> >>>> Hi Nathan,
> >> >> >>>>
> >> >> >>>> What I got is 936, the code page identifier. Is there a api for
> us
> >> >> >>>> to map
> >> >> >>>> 936 to the gb2312?
> >> >> >>>
> >> >> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to
> >> >> >>> translate
> >> >> >>> that into a name of some sort. I'll poke around a bit and see
> what
> >> >> >>> I
> >> >> >>> can find.
> >> >> >>
> >> >> >> We'll probably just have to put in a mapping ourselves based on
> the
> >> >> >> documentation. We'd call GetACP [1] and map that to a known alias
> in
> >> >> >> java.nio.charset that matches the definitions[2] of the
> identifiers.
> >> >> >>
> >> >> >> [1]
> http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> >> >> [2]
> http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >> >> >
> >> >> > This may be better - APR has a function for getting the OS default
> >> >> > encoding. This would work across all platforms that APR supports
> and
> >> >> > I
> >> >> > believe we already use APR.
> >> >> >
> >> >> >
> >> >> >
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
> >> >>
> >> >> However, the Windows version of this is simply - return
> >> >> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is
> essentially
> >> >> "CP" + codePageId.
> >> >>
> >> >> And the Unix version of this method doesn't look very good for our
> >> >> purposes.
> >> >> >
> >> >> > -Nathan
> >> >
> >> > Yep - that's why APR was not used here initially. I guess your idea of
> >> > GetACP() + hardcoded mapping is the most suitable approach. We already
> >> > have similar solution for timezone detection, see
> >> > working_vm\vm\port\src\misc\win\timezone.c (which also should be moved
> >> > to classlib eventually, HARMONY-2053).
> >>
> >> I'd be inclined to combine these all together into the portlib
> >> (luni?). Perhaps in some sort of OS environment portion, which can be
> >> used by the rest of the class library.
> >>
> >> -Nathan
> >>
> >> >
> >> > --
> >> > Alexey
> >> >
> >
> >
> >
> > --
> > Yours sincerely,
> > Charles Lee
> >
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

I don't think the Windows logic will be quite that simple - I think
we'll have to recreate the mapping defined by the Windows API [1]. In
the case of 936, we'd convert to gb2312, per [1].

The default value is going to vary on each platform. On Windows, if
the we can't determine locale information, then we'll default to "en"
and encoding of "Windows-1252"

-Nathan

[1] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx

On Mon, Jul 20, 2009 at 5:05 AM, Charles Lee<li...@gmail.com> wrote:
> Hi guys,
>
> A new patch is attached but still fail on the windows. It *seems* VM do not
> support CP936.
>
> 1. I have tried to hard code "CP936" in the luniglob.c, make the
> file.encoding always be CP936. The vm failed to launch with the message
> "HMYEXEL054E vm inner fault: can not create java/lang/String, FAILED to
> invoke JVM" (The original msg is Chinese, I am translating it)
> 2. I have tried to hard code "UTF-8" in the luniglob.c, make the
> file.encoding always be UTF-8. The vm sucessfully launch and tests have been
> passed.
>
> Does somebody know where the vm load the String? And what does "HMYEXEL054E"
> mean?
>
> On Sat, Jul 18, 2009 at 11:10 AM, Nathan Beyer <nd...@apache.org> wrote:
>>
>> On Fri, Jul 17, 2009 at 6:03 AM, Alexey
>> Varlamov<al...@gmail.com> wrote:
>> > 2009/7/17, Nathan Beyer <nd...@apache.org>:
>> >> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org>
>> >> wrote:
>> >> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org>
>> >> > wrote:
>> >> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
>> >> >> wrote:
>> >> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com>
>> >> >>> wrote:
>> >> >>>> Hi Nathan,
>> >> >>>>
>> >> >>>> What I got is 936, the code page identifier. Is there a api for us
>> >> >>>> to map
>> >> >>>> 936 to the gb2312?
>> >> >>>
>> >> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to
>> >> >>> translate
>> >> >>> that into a name of some sort. I'll poke around a bit and see what
>> >> >>> I
>> >> >>> can find.
>> >> >>
>> >> >> We'll probably just have to put in a mapping ourselves based on the
>> >> >> documentation. We'd call GetACP [1] and map that to a known alias in
>> >> >> java.nio.charset that matches the definitions[2] of the identifiers.
>> >> >>
>> >> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
>> >> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
>> >> >
>> >> > This may be better - APR has a function for getting the OS default
>> >> > encoding. This would work across all platforms that APR supports and
>> >> > I
>> >> > believe we already use APR.
>> >> >
>> >> >
>> >> > http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
>> >>
>> >> However, the Windows version of this is simply - return
>> >> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
>> >> "CP" + codePageId.
>> >>
>> >> And the Unix version of this method doesn't look very good for our
>> >> purposes.
>> >> >
>> >> > -Nathan
>> >
>> > Yep - that's why APR was not used here initially. I guess your idea of
>> > GetACP() + hardcoded mapping is the most suitable approach. We already
>> > have similar solution for timezone detection, see
>> > working_vm\vm\port\src\misc\win\timezone.c (which also should be moved
>> > to classlib eventually, HARMONY-2053).
>>
>> I'd be inclined to combine these all together into the portlib
>> (luni?). Perhaps in some sort of OS environment portion, which can be
>> used by the rest of the class library.
>>
>> -Nathan
>>
>> >
>> > --
>> > Alexey
>> >
>
>
>
> --
> Yours sincerely,
> Charles Lee
>
>

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Hi guys,

A new patch is attached but still fail on the windows. It *seems* VM do not
support CP936.

1. I have tried to hard code "CP936" in the luniglob.c, make the
file.encoding always be CP936. The vm failed to launch with the message
"HMYEXEL054E vm inner fault: can not create java/lang/String, FAILED to
invoke JVM" (The original msg is Chinese, I am translating it)
2. I have tried to hard code "UTF-8" in the luniglob.c, make the
file.encoding always be UTF-8. The vm sucessfully launch and tests have been
passed.

Does somebody know where the vm load the String? And what does "HMYEXEL054E"
mean?

On Sat, Jul 18, 2009 at 11:10 AM, Nathan Beyer <nd...@apache.org> wrote:

> On Fri, Jul 17, 2009 at 6:03 AM, Alexey
> Varlamov<al...@gmail.com> wrote:
> > 2009/7/17, Nathan Beyer <nd...@apache.org>:
> >> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com>
> wrote:
> >> >>>> Hi Nathan,
> >> >>>>
> >> >>>> What I got is 936, the code page identifier. Is there a api for us
> to map
> >> >>>> 936 to the gb2312?
> >> >>>
> >> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> >> >>> that into a name of some sort. I'll poke around a bit and see what I
> >> >>> can find.
> >> >>
> >> >> We'll probably just have to put in a mapping ourselves based on the
> >> >> documentation. We'd call GetACP [1] and map that to a known alias in
> >> >> java.nio.charset that matches the definitions[2] of the identifiers.
> >> >>
> >> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >> >
> >> > This may be better - APR has a function for getting the OS default
> >> > encoding. This would work across all platforms that APR supports and I
> >> > believe we already use APR.
> >> >
> >> >
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
> >>
> >> However, the Windows version of this is simply - return
> >> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> >> "CP" + codePageId.
> >>
> >> And the Unix version of this method doesn't look very good for our
> purposes.
> >> >
> >> > -Nathan
> >
> > Yep - that's why APR was not used here initially. I guess your idea of
> > GetACP() + hardcoded mapping is the most suitable approach. We already
> > have similar solution for timezone detection, see
> > working_vm\vm\port\src\misc\win\timezone.c (which also should be moved
> > to classlib eventually, HARMONY-2053).
>
> I'd be inclined to combine these all together into the portlib
> (luni?). Perhaps in some sort of OS environment portion, which can be
> used by the rest of the class library.
>
> -Nathan
>
> >
> > --
> > Alexey
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

On Fri, Jul 17, 2009 at 6:03 AM, Alexey
Varlamov<al...@gmail.com> wrote:
> 2009/7/17, Nathan Beyer <nd...@apache.org>:
>> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org> wrote:
>> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org> wrote:
>> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org> wrote:
>> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com> wrote:
>> >>>> Hi Nathan,
>> >>>>
>> >>>> What I got is 936, the code page identifier. Is there a api for us to map
>> >>>> 936 to the gb2312?
>> >>>
>> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
>> >>> that into a name of some sort. I'll poke around a bit and see what I
>> >>> can find.
>> >>
>> >> We'll probably just have to put in a mapping ourselves based on the
>> >> documentation. We'd call GetACP [1] and map that to a known alias in
>> >> java.nio.charset that matches the definitions[2] of the identifiers.
>> >>
>> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
>> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
>> >
>> > This may be better - APR has a function for getting the OS default
>> > encoding. This would work across all platforms that APR supports and I
>> > believe we already use APR.
>> >
>> > http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
>>
>> However, the Windows version of this is simply - return
>> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
>> "CP" + codePageId.
>>
>> And the Unix version of this method doesn't look very good for our purposes.
>> >
>> > -Nathan
>
> Yep - that's why APR was not used here initially. I guess your idea of
> GetACP() + hardcoded mapping is the most suitable approach. We already
> have similar solution for timezone detection, see
> working_vm\vm\port\src\misc\win\timezone.c (which also should be moved
> to classlib eventually, HARMONY-2053).

I'd be inclined to combine these all together into the portlib
(luni?). Perhaps in some sort of OS environment portion, which can be
used by the rest of the class library.

-Nathan

>
> --
> Alexey
>

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Thanks Alexey!  Also will try this :-)

On Fri, Jul 17, 2009 at 7:03 PM, Alexey Varlamov <
alexey.v.varlamov@gmail.com> wrote:

> 2009/7/17, Nathan Beyer <nd...@apache.org>:
> > On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org> wrote:
> > > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> > >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> > >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com>
> wrote:
> > >>>> Hi Nathan,
> > >>>>
> > >>>> What I got is 936, the code page identifier. Is there a api for us
> to map
> > >>>> 936 to the gb2312?
> > >>>
> > >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> > >>> that into a name of some sort. I'll poke around a bit and see what I
> > >>> can find.
> > >>
> > >> We'll probably just have to put in a mapping ourselves based on the
> > >> documentation. We'd call GetACP [1] and map that to a known alias in
> > >> java.nio.charset that matches the definitions[2] of the identifiers.
> > >>
> > >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> > >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> > >
> > > This may be better - APR has a function for getting the OS default
> > > encoding. This would work across all platforms that APR supports and I
> > > believe we already use APR.
> > >
> > >
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
> >
> > However, the Windows version of this is simply - return
> > apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> > "CP" + codePageId.
> >
> > And the Unix version of this method doesn't look very good for our
> purposes.
> > >
> > > -Nathan
>
> Yep - that's why APR was not used here initially. I guess your idea of
> GetACP() + hardcoded mapping is the most suitable approach. We already
> have similar solution for timezone detection, see
> working_vm\vm\port\src\misc\win\timezone.c (which also should be moved
> to classlib eventually, HARMONY-2053).
>
> --
> Alexey
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Alexey Varlamov <al...@gmail.com>.

2009/7/17, Nathan Beyer <nd...@apache.org>:
> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org> wrote:
> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org> wrote:
> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org> wrote:
> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com> wrote:
> >>>> Hi Nathan,
> >>>>
> >>>> What I got is 936, the code page identifier. Is there a api for us to map
> >>>> 936 to the gb2312?
> >>>
> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> >>> that into a name of some sort. I'll poke around a bit and see what I
> >>> can find.
> >>
> >> We'll probably just have to put in a mapping ourselves based on the
> >> documentation. We'd call GetACP [1] and map that to a known alias in
> >> java.nio.charset that matches the definitions[2] of the identifiers.
> >>
> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >
> > This may be better - APR has a function for getting the OS default
> > encoding. This would work across all platforms that APR supports and I
> > believe we already use APR.
> >
> > http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
>
> However, the Windows version of this is simply - return
> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> "CP" + codePageId.
>
> And the Unix version of this method doesn't look very good for our purposes.
> >
> > -Nathan

Yep - that's why APR was not used here initially. I guess your idea of
GetACP() + hardcoded mapping is the most suitable approach. We already
have similar solution for timezone detection, see
working_vm\vm\port\src\misc\win\timezone.c (which also should be moved
to classlib eventually, HARMONY-2053).

--
Alexey

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

On Fri, Jul 17, 2009 at 11:17 AM, Nathan Beyer <nb...@gmail.com> wrote:

> On Thu, Jul 16, 2009 at 9:30 PM, Charles Lee<li...@gmail.com> wrote:
> > Thanks Nathan!
> >
> > I will try this :-)
>
> Where do we define the user's locale and system locale? It seems like
> all of this should be located there and associated with that process.>
>

Sorry Nathan, I do not catch that. Do mean shall we get the user's locale or
system locale?


> > On Fri, Jul 17, 2009 at 10:05 AM, Nathan Beyer <nd...@apache.org>
> wrote:
> >
> >> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
> >> wrote:
> >> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com>
> >> wrote:
> >> >>>> Hi Nathan,
> >> >>>>
> >> >>>> What I got is 936, the code page identifier. Is there a api for us
> to
> >> map
> >> >>>> 936 to the gb2312?
> >> >>>
> >> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> >> >>> that into a name of some sort. I'll poke around a bit and see what I
> >> >>> can find.
> >> >>
> >> >> We'll probably just have to put in a mapping ourselves based on the
> >> >> documentation. We'd call GetACP [1] and map that to a known alias in
> >> >> java.nio.charset that matches the definitions[2] of the identifiers.
> >> >>
> >> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >> >
> >> > This may be better - APR has a function for getting the OS default
> >> > encoding. This would work across all platforms that APR supports and I
> >> > believe we already use APR.
> >> >
> >> >
> >>
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
> >>
> >> However, the Windows version of this is simply - return
> >> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> >> "CP" + codePageId.
> >>
> >> And the Unix version of this method doesn't look very good for our
> >> purposes.
> >> >
> >> > -Nathan
> >> >>
> >> >>>
> >> >>>> If we put 936 in the file.encoding, can we successfully get the
> >> encoder and
> >> >>>> decoder by charset?
> >> >>>>
> >> >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org>
> >> wrote:
> >> >>>>
> >> >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<
> littlee1032@gmail.com>
> >> wrote:
> >> >>>>> > Hi guys,
> >> >>>>> >
> >> >>>>> > I have add the locale function in the drlvm, the patch is
> attached.
> >> >>>>> Please
> >> >>>>> > try this new patch on the linux.
> >> >>>>> >
> >> >>>>> > The patch should work on the linux but fail on the windows.
> Because
> >> >>>>> windows
> >> >>>>> > returns code page not charset from the setlocale.
> >> >>>>>
> >> >>>>> Code page and character set are the same thing. We shouldn't need
> to
> >> >>>>> convert it as the Charset APIs will have to support the values
> >> anyway.
> >> >>>>>
> >> >>>>> What's the value you're getting? If it's 'Cp1252', then we're
> good,
> >> as
> >> >>>>> that's just an alias for 'Windows-1252' (or vice-versa).
> >> >>>>>
> >> >>>>> -Nathan
> >> >>>>>
> >> >>>>>
> >> >>>>> > I hv tried long time to
> >> >>>>> > get the charset name from the codepage, for example:
> >> >>>>> > CPINFOEX cpInfoEx;
> >> >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> >> >>>>> > if (iReturn > 0) {
> >> >>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> >> >>>>> > }
> >> >>>>> > But I only get the full name without any format.
> >> >>>>> >
> >> >>>>> > There is code page identifiers map in the msdn, detail here. I
> may
> >> hard
> >> >>>>> code
> >> >>>>> > this map in the file. But the note on the msdn says:
> >> >>>>> >      "ANSI code pages can be different on different computers,
> or
> >> can be
> >> >>>>> > changed for a single computer, leading to data corruption. For
> the
> >> most
> >> >>>>> > consistent results, applications should use Unicode, such as
> UTF-8
> >> or
> >> >>>>> > UTF-16, instead of a specific code page."
> >> >>>>> > I am afraid hard-code will fail on some machines. (By the way,
> this
> >> seems
> >> >>>>> > the UTF-8 is suggested to be the default again :-)
> >> >>>>> >
> >> >>>>> > There is also a class Encoding in the VC++, detail here. But we
> can
> >> not
> >> >>>>> use
> >> >>>>> > it here.
> >> >>>>> >
> >> >>>>> > So anyone knows some thing about locale on the windows?
> >> >>>>> > Again, shall use UTF-8 as our default?
> >> >>>>> >
> >> >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <
> >> littlee1032@gmail.com>
> >> >>>>> wrote:
> >> >>>>> >>
> >> >>>>> >> That seems we should add it in the drlvm.
> >> >>>>> >>
> >> >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com>
> >> wrote:
> >> >>>>> >>>
> >> >>>>> >>> Nathan Beyer wrote:
> >> >>>>> >>>>
> >> >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need
> to
> >> fix
> >> >>>>> >>>> DRLVM?
> >> >>>>> >>>
> >> >>>>> >>> Yes, I only tested on Linux, IBM VME set the property
> correctly.
> >> >>>>> >>>
> >> >>>>> >>>>
> >> >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com>
> >> wrote:
> >> >>>>> >>>>>
> >> >>>>> >>>>> Kevin Zhou wrote:
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the
> "file.encoding"
> >> >>>>> property
> >> >>>>> >>>>>> adown
> >> >>>>> >>>>>> VM but fails to get the correct encoding.
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain
> >> the
> >> >>>>> right
> >> >>>>> >>>>>> property?
> >> >>>>> >>>>>
> >> >>>>> >>>>> We can get from OS directly. Maybe just read env variables
> on
> >> Linux?
> >> >>>>> >>>>>
> >> >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com>
> >> wrote:
> >> >>>>> >>>>>>
> >> >>>>> >>>>>>> Charles Lee wrote:
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>> Hi Nanthan,
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the
> >> some
> >> >>>>> bugs
> >> >>>>> >>>>>>>> in
> >> >>>>> >>>>>>>> it
> >> >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8.
> Our
> >> default
> >> >>>>> >>>>>>>> codec
> >> >>>>> >>>>>>>> is
> >> >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such
> codes?
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>> Classlib expected vm do this and set the property, but it
> >> didn't,
> >> >>>>> so
> >> >>>>> >>>>>>> we
> >> >>>>> >>>>>>> have to do this by ourselves.
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <
> >> nbeyer@gmail.com>
> >> >>>>> >>>>>>>> wrote:
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>>  Are we talking about windows or linux?the default file
> >> encoding
> >> >>>>> >>>>>>>> should
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the
> specs.
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>> Sent from my iPhone
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <
> >> littlee1032@gmail.com>
> >> >>>>> >>>>>>>>> wrote:
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
> >> >>>>> >>>>>>>>> <fi...@gmail.com>
> >> >>>>> >>>>>>>>>
> >> >>>>> >>>>>>>>>> wrote:
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>>>>>  Hi,
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for
> RI,
> >> and
> >> >>>>> it
> >> >>>>> >>>>>>>>>>> sounds
> >> >>>>> >>>>>>>>>>> reasonable.
> >> >>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem,
> maybe
> >> we
> >> >>>>> need
> >> >>>>> >>>>>>>>>>> to
> >> >>>>> >>>>>>>>>>> run
> >> >>>>> >>>>>>>>>>> more tests to verify?
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  Hi guys:
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case
> >> and
> >> >>>>> >>>>>>>>>>>> meeting
> >> >>>>> >>>>>>>>>>>> some
> >> >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by
> the
> >> >>>>> different
> >> >>>>> >>>>>>>>>>>> default
> >> >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is
> en_US.UTF-8,
> >> RI
> >> >>>>> >>>>>>>>>>>> default is
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  UTF-8
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> HARMONY-3736<
> >> >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
> >> >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we
> >> always
> >> >>>>> get
> >> >>>>> >>>>>>>>>>>> 8859-1.
> >> >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always
> get
> >> null
> >> >>>>> if
> >> >>>>> >>>>>>>>>>>> we
> >> >>>>> >>>>>>>>>>>> call
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  vm
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  method
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got
> >> null
> >> >>>>> from
> >> >>>>> >>>>>>>>>>>> vm,
> >> >>>>> >>>>>>>>>>>> we
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  set
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>  8859-1.
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the
> >> non-ascii
> >> >>>>> >>>>>>>>>>>> character.
> >> >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
> >> >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1
> ,
> >> it says
> >> >>>>> >>>>>>>>>>>> "In
> >> >>>>> >>>>>>>>>>>> computing
> >> >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support
> >> (such as
> >> >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> >> >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are
> >> finding
> >> >>>>> >>>>>>>>>>>> increasing
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>  favor
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply
> >> change
> >> >>>>> >>>>>>>>>>> iso8859-1
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> to
> >> >>>>> >>>>>>>>>>>> utf-8?
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>> --
> >> >>>>> >>>>>>>>>>>> Yours sincerely,
> >> >>>>> >>>>>>>>>>>> Charles Lee
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> --
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> Best Regards!
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>> Jimmy, Jing Lv
> >> >>>>> >>>>>>>>>>> China Software Development Lab, IBM
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>>>
> >> >>>>> >>>>>>>>>> --
> >> >>>>> >>>>>>>>>> Yours sincerely,
> >> >>>>> >>>>>>>>>> Charles Lee
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>>>>>
> >> >>>>> >>>>>>> --
> >> >>>>> >>>>>>> Best Regards,
> >> >>>>> >>>>>>> Regis.
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>
> >> >>>>> >>>>> --
> >> >>>>> >>>>> Best Regards,
> >> >>>>> >>>>> Regis.
> >> >>>>> >>>>>
> >> >>>>> >>>>
> >> >>>>> >>>
> >> >>>>> >>>
> >> >>>>> >>> --
> >> >>>>> >>> Best Regards,
> >> >>>>> >>> Regis.
> >> >>>>> >>
> >> >>>>> >>
> >> >>>>> >>
> >> >>>>> >> --
> >> >>>>> >> Yours sincerely,
> >> >>>>> >> Charles Lee
> >> >>>>> >>
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > --
> >> >>>>> > Yours sincerely,
> >> >>>>> > Charles Lee
> >> >>>>> >
> >> >>>>> >
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Yours sincerely,
> >> >>>> Charles Lee
> >> >>>>
> >> >>>
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Yours sincerely,
> > Charles Lee
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nb...@gmail.com>.

On Thu, Jul 16, 2009 at 9:30 PM, Charles Lee<li...@gmail.com> wrote:
> Thanks Nathan!
>
> I will try this :-)

Where do we define the user's locale and system locale? It seems like
all of this should be located there and associated with that process.

>
> On Fri, Jul 17, 2009 at 10:05 AM, Nathan Beyer <nd...@apache.org> wrote:
>
>> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org> wrote:
>> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org> wrote:
>> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
>> wrote:
>> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com>
>> wrote:
>> >>>> Hi Nathan,
>> >>>>
>> >>>> What I got is 936, the code page identifier. Is there a api for us to
>> map
>> >>>> 936 to the gb2312?
>> >>>
>> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
>> >>> that into a name of some sort. I'll poke around a bit and see what I
>> >>> can find.
>> >>
>> >> We'll probably just have to put in a mapping ourselves based on the
>> >> documentation. We'd call GetACP [1] and map that to a known alias in
>> >> java.nio.charset that matches the definitions[2] of the identifiers.
>> >>
>> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
>> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
>> >
>> > This may be better - APR has a function for getting the OS default
>> > encoding. This would work across all platforms that APR supports and I
>> > believe we already use APR.
>> >
>> >
>> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
>>
>> However, the Windows version of this is simply - return
>> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
>> "CP" + codePageId.
>>
>> And the Unix version of this method doesn't look very good for our
>> purposes.
>> >
>> > -Nathan
>> >>
>> >>>
>> >>>> If we put 936 in the file.encoding, can we successfully get the
>> encoder and
>> >>>> decoder by charset?
>> >>>>
>> >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org>
>> wrote:
>> >>>>
>> >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com>
>> wrote:
>> >>>>> > Hi guys,
>> >>>>> >
>> >>>>> > I have add the locale function in the drlvm, the patch is attached.
>> >>>>> Please
>> >>>>> > try this new patch on the linux.
>> >>>>> >
>> >>>>> > The patch should work on the linux but fail on the windows. Because
>> >>>>> windows
>> >>>>> > returns code page not charset from the setlocale.
>> >>>>>
>> >>>>> Code page and character set are the same thing. We shouldn't need to
>> >>>>> convert it as the Charset APIs will have to support the values
>> anyway.
>> >>>>>
>> >>>>> What's the value you're getting? If it's 'Cp1252', then we're good,
>> as
>> >>>>> that's just an alias for 'Windows-1252' (or vice-versa).
>> >>>>>
>> >>>>> -Nathan
>> >>>>>
>> >>>>>
>> >>>>> > I hv tried long time to
>> >>>>> > get the charset name from the codepage, for example:
>> >>>>> > CPINFOEX cpInfoEx;
>> >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>> >>>>> > if (iReturn > 0) {
>> >>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>> >>>>> > }
>> >>>>> > But I only get the full name without any format.
>> >>>>> >
>> >>>>> > There is code page identifiers map in the msdn, detail here. I may
>> hard
>> >>>>> code
>> >>>>> > this map in the file. But the note on the msdn says:
>> >>>>> >      "ANSI code pages can be different on different computers, or
>> can be
>> >>>>> > changed for a single computer, leading to data corruption. For the
>> most
>> >>>>> > consistent results, applications should use Unicode, such as UTF-8
>> or
>> >>>>> > UTF-16, instead of a specific code page."
>> >>>>> > I am afraid hard-code will fail on some machines. (By the way, this
>> seems
>> >>>>> > the UTF-8 is suggested to be the default again :-)
>> >>>>> >
>> >>>>> > There is also a class Encoding in the VC++, detail here. But we can
>> not
>> >>>>> use
>> >>>>> > it here.
>> >>>>> >
>> >>>>> > So anyone knows some thing about locale on the windows?
>> >>>>> > Again, shall use UTF-8 as our default?
>> >>>>> >
>> >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <
>> littlee1032@gmail.com>
>> >>>>> wrote:
>> >>>>> >>
>> >>>>> >> That seems we should add it in the drlvm.
>> >>>>> >>
>> >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com>
>> wrote:
>> >>>>> >>>
>> >>>>> >>> Nathan Beyer wrote:
>> >>>>> >>>>
>> >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to
>> fix
>> >>>>> >>>> DRLVM?
>> >>>>> >>>
>> >>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>> >>>>> >>>
>> >>>>> >>>>
>> >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com>
>> wrote:
>> >>>>> >>>>>
>> >>>>> >>>>> Kevin Zhou wrote:
>> >>>>> >>>>>>
>> >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>> >>>>> property
>> >>>>> >>>>>> adown
>> >>>>> >>>>>> VM but fails to get the correct encoding.
>> >>>>> >>>>>>
>> >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain
>> the
>> >>>>> right
>> >>>>> >>>>>> property?
>> >>>>> >>>>>
>> >>>>> >>>>> We can get from OS directly. Maybe just read env variables on
>> Linux?
>> >>>>> >>>>>
>> >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com>
>> wrote:
>> >>>>> >>>>>>
>> >>>>> >>>>>>> Charles Lee wrote:
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>> Hi Nanthan,
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the
>> some
>> >>>>> bugs
>> >>>>> >>>>>>>> in
>> >>>>> >>>>>>>> it
>> >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our
>> default
>> >>>>> >>>>>>>> codec
>> >>>>> >>>>>>>> is
>> >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>> Classlib expected vm do this and set the property, but it
>> didn't,
>> >>>>> so
>> >>>>> >>>>>>> we
>> >>>>> >>>>>>> have to do this by ourselves.
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <
>> nbeyer@gmail.com>
>> >>>>> >>>>>>>> wrote:
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>>  Are we talking about windows or linux?the default file
>> encoding
>> >>>>> >>>>>>>> should
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>> Sent from my iPhone
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <
>> littlee1032@gmail.com>
>> >>>>> >>>>>>>>> wrote:
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>> >>>>> >>>>>>>>> <fi...@gmail.com>
>> >>>>> >>>>>>>>>
>> >>>>> >>>>>>>>>> wrote:
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>>>>>  Hi,
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI,
>> and
>> >>>>> it
>> >>>>> >>>>>>>>>>> sounds
>> >>>>> >>>>>>>>>>> reasonable.
>> >>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe
>> we
>> >>>>> need
>> >>>>> >>>>>>>>>>> to
>> >>>>> >>>>>>>>>>> run
>> >>>>> >>>>>>>>>>> more tests to verify?
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  Hi guys:
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case
>> and
>> >>>>> >>>>>>>>>>>> meeting
>> >>>>> >>>>>>>>>>>> some
>> >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>> >>>>> different
>> >>>>> >>>>>>>>>>>> default
>> >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8,
>> RI
>> >>>>> >>>>>>>>>>>> default is
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  UTF-8
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> HARMONY-3736<
>> >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>> >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we
>> always
>> >>>>> get
>> >>>>> >>>>>>>>>>>> 8859-1.
>> >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get
>> null
>> >>>>> if
>> >>>>> >>>>>>>>>>>> we
>> >>>>> >>>>>>>>>>>> call
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  vm
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  method
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got
>> null
>> >>>>> from
>> >>>>> >>>>>>>>>>>> vm,
>> >>>>> >>>>>>>>>>>> we
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  set
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>  8859-1.
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the
>> non-ascii
>> >>>>> >>>>>>>>>>>> character.
>> >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>> >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1,
>> it says
>> >>>>> >>>>>>>>>>>> "In
>> >>>>> >>>>>>>>>>>> computing
>> >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support
>> (such as
>> >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>> >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are
>> finding
>> >>>>> >>>>>>>>>>>> increasing
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>  favor
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply
>> change
>> >>>>> >>>>>>>>>>> iso8859-1
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> to
>> >>>>> >>>>>>>>>>>> utf-8?
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>> --
>> >>>>> >>>>>>>>>>>> Yours sincerely,
>> >>>>> >>>>>>>>>>>> Charles Lee
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>>>
>> >>>>> >>>>>>>>>>> --
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>> Best Regards!
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>> Jimmy, Jing Lv
>> >>>>> >>>>>>>>>>> China Software Development Lab, IBM
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>>>
>> >>>>> >>>>>>>>>> --
>> >>>>> >>>>>>>>>> Yours sincerely,
>> >>>>> >>>>>>>>>> Charles Lee
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>>>>>
>> >>>>> >>>>>>> --
>> >>>>> >>>>>>> Best Regards,
>> >>>>> >>>>>>> Regis.
>> >>>>> >>>>>>>
>> >>>>> >>>>>
>> >>>>> >>>>> --
>> >>>>> >>>>> Best Regards,
>> >>>>> >>>>> Regis.
>> >>>>> >>>>>
>> >>>>> >>>>
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>> --
>> >>>>> >>> Best Regards,
>> >>>>> >>> Regis.
>> >>>>> >>
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> --
>> >>>>> >> Yours sincerely,
>> >>>>> >> Charles Lee
>> >>>>> >>
>> >>>>> >
>> >>>>> >
>> >>>>> >
>> >>>>> > --
>> >>>>> > Yours sincerely,
>> >>>>> > Charles Lee
>> >>>>> >
>> >>>>> >
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Yours sincerely,
>> >>>> Charles Lee
>> >>>>
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Yours sincerely,
> Charles Lee
>

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Thanks Nathan!

I will try this :-)

On Fri, Jul 17, 2009 at 10:05 AM, Nathan Beyer <nd...@apache.org> wrote:

> On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org> wrote:
> > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org> wrote:
> >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org>
> wrote:
> >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com>
> wrote:
> >>>> Hi Nathan,
> >>>>
> >>>> What I got is 936, the code page identifier. Is there a api for us to
> map
> >>>> 936 to the gb2312?
> >>>
> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> >>> that into a name of some sort. I'll poke around a bit and see what I
> >>> can find.
> >>
> >> We'll probably just have to put in a mapping ourselves based on the
> >> documentation. We'd call GetACP [1] and map that to a known alias in
> >> java.nio.charset that matches the definitions[2] of the identifiers.
> >>
> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
> >
> > This may be better - APR has a function for getting the OS default
> > encoding. This would work across all platforms that APR supports and I
> > believe we already use APR.
> >
> >
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
>
> However, the Windows version of this is simply - return
> apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
> "CP" + codePageId.
>
> And the Unix version of this method doesn't look very good for our
> purposes.
> >
> > -Nathan
> >>
> >>>
> >>>> If we put 936 in the file.encoding, can we successfully get the
> encoder and
> >>>> decoder by charset?
> >>>>
> >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org>
> wrote:
> >>>>
> >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com>
> wrote:
> >>>>> > Hi guys,
> >>>>> >
> >>>>> > I have add the locale function in the drlvm, the patch is attached.
> >>>>> Please
> >>>>> > try this new patch on the linux.
> >>>>> >
> >>>>> > The patch should work on the linux but fail on the windows. Because
> >>>>> windows
> >>>>> > returns code page not charset from the setlocale.
> >>>>>
> >>>>> Code page and character set are the same thing. We shouldn't need to
> >>>>> convert it as the Charset APIs will have to support the values
> anyway.
> >>>>>
> >>>>> What's the value you're getting? If it's 'Cp1252', then we're good,
> as
> >>>>> that's just an alias for 'Windows-1252' (or vice-versa).
> >>>>>
> >>>>> -Nathan
> >>>>>
> >>>>>
> >>>>> > I hv tried long time to
> >>>>> > get the charset name from the codepage, for example:
> >>>>> > CPINFOEX cpInfoEx;
> >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> >>>>> > if (iReturn > 0) {
> >>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> >>>>> > }
> >>>>> > But I only get the full name without any format.
> >>>>> >
> >>>>> > There is code page identifiers map in the msdn, detail here. I may
> hard
> >>>>> code
> >>>>> > this map in the file. But the note on the msdn says:
> >>>>> >      "ANSI code pages can be different on different computers, or
> can be
> >>>>> > changed for a single computer, leading to data corruption. For the
> most
> >>>>> > consistent results, applications should use Unicode, such as UTF-8
> or
> >>>>> > UTF-16, instead of a specific code page."
> >>>>> > I am afraid hard-code will fail on some machines. (By the way, this
> seems
> >>>>> > the UTF-8 is suggested to be the default again :-)
> >>>>> >
> >>>>> > There is also a class Encoding in the VC++, detail here. But we can
> not
> >>>>> use
> >>>>> > it here.
> >>>>> >
> >>>>> > So anyone knows some thing about locale on the windows?
> >>>>> > Again, shall use UTF-8 as our default?
> >>>>> >
> >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <
> littlee1032@gmail.com>
> >>>>> wrote:
> >>>>> >>
> >>>>> >> That seems we should add it in the drlvm.
> >>>>> >>
> >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com>
> wrote:
> >>>>> >>>
> >>>>> >>> Nathan Beyer wrote:
> >>>>> >>>>
> >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to
> fix
> >>>>> >>>> DRLVM?
> >>>>> >>>
> >>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
> >>>>> >>>
> >>>>> >>>>
> >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com>
> wrote:
> >>>>> >>>>>
> >>>>> >>>>> Kevin Zhou wrote:
> >>>>> >>>>>>
> >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
> >>>>> property
> >>>>> >>>>>> adown
> >>>>> >>>>>> VM but fails to get the correct encoding.
> >>>>> >>>>>>
> >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain
> the
> >>>>> right
> >>>>> >>>>>> property?
> >>>>> >>>>>
> >>>>> >>>>> We can get from OS directly. Maybe just read env variables on
> Linux?
> >>>>> >>>>>
> >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com>
> wrote:
> >>>>> >>>>>>
> >>>>> >>>>>>> Charles Lee wrote:
> >>>>> >>>>>>>
> >>>>> >>>>>>>> Hi Nanthan,
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the
> some
> >>>>> bugs
> >>>>> >>>>>>>> in
> >>>>> >>>>>>>> it
> >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our
> default
> >>>>> >>>>>>>> codec
> >>>>> >>>>>>>> is
> >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
> >>>>> >>>>>>>>
> >>>>> >>>>>>> Classlib expected vm do this and set the property, but it
> didn't,
> >>>>> so
> >>>>> >>>>>>> we
> >>>>> >>>>>>> have to do this by ourselves.
> >>>>> >>>>>>>
> >>>>> >>>>>>>
> >>>>> >>>>>>>
> >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <
> nbeyer@gmail.com>
> >>>>> >>>>>>>> wrote:
> >>>>> >>>>>>>>
> >>>>> >>>>>>>>  Are we talking about windows or linux?the default file
> encoding
> >>>>> >>>>>>>> should
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Sent from my iPhone
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <
> littlee1032@gmail.com>
> >>>>> >>>>>>>>> wrote:
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
> >>>>> >>>>>>>>> <fi...@gmail.com>
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>> wrote:
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>  Hi,
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI,
> and
> >>>>> it
> >>>>> >>>>>>>>>>> sounds
> >>>>> >>>>>>>>>>> reasonable.
> >>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe
> we
> >>>>> need
> >>>>> >>>>>>>>>>> to
> >>>>> >>>>>>>>>>> run
> >>>>> >>>>>>>>>>> more tests to verify?
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  Hi guys:
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case
> and
> >>>>> >>>>>>>>>>>> meeting
> >>>>> >>>>>>>>>>>> some
> >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
> >>>>> different
> >>>>> >>>>>>>>>>>> default
> >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8,
> RI
> >>>>> >>>>>>>>>>>> default is
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  UTF-8
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> HARMONY-3736<
> >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
> >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we
> always
> >>>>> get
> >>>>> >>>>>>>>>>>> 8859-1.
> >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get
> null
> >>>>> if
> >>>>> >>>>>>>>>>>> we
> >>>>> >>>>>>>>>>>> call
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  vm
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  method
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got
> null
> >>>>> from
> >>>>> >>>>>>>>>>>> vm,
> >>>>> >>>>>>>>>>>> we
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  set
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>  8859-1.
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the
> non-ascii
> >>>>> >>>>>>>>>>>> character.
> >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
> >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1,
> it says
> >>>>> >>>>>>>>>>>> "In
> >>>>> >>>>>>>>>>>> computing
> >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support
> (such as
> >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are
> finding
> >>>>> >>>>>>>>>>>> increasing
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>  favor
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply
> change
> >>>>> >>>>>>>>>>> iso8859-1
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> to
> >>>>> >>>>>>>>>>>> utf-8?
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> --
> >>>>> >>>>>>>>>>>> Yours sincerely,
> >>>>> >>>>>>>>>>>> Charles Lee
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>> --
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> Best Regards!
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> Jimmy, Jing Lv
> >>>>> >>>>>>>>>>> China Software Development Lab, IBM
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>> --
> >>>>> >>>>>>>>>> Yours sincerely,
> >>>>> >>>>>>>>>> Charles Lee
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>> --
> >>>>> >>>>>>> Best Regards,
> >>>>> >>>>>>> Regis.
> >>>>> >>>>>>>
> >>>>> >>>>>
> >>>>> >>>>> --
> >>>>> >>>>> Best Regards,
> >>>>> >>>>> Regis.
> >>>>> >>>>>
> >>>>> >>>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> --
> >>>>> >>> Best Regards,
> >>>>> >>> Regis.
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> --
> >>>>> >> Yours sincerely,
> >>>>> >> Charles Lee
> >>>>> >>
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Yours sincerely,
> >>>>> > Charles Lee
> >>>>> >
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Yours sincerely,
> >>>> Charles Lee
> >>>>
> >>>
> >>
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<nd...@apache.org> wrote:
> On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org> wrote:
>> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org> wrote:
>>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com> wrote:
>>>> Hi Nathan,
>>>>
>>>> What I got is 936, the code page identifier. Is there a api for us to map
>>>> 936 to the gb2312?
>>>
>>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
>>> that into a name of some sort. I'll poke around a bit and see what I
>>> can find.
>>
>> We'll probably just have to put in a mapping ourselves based on the
>> documentation. We'd call GetACP [1] and map that to a known alias in
>> java.nio.charset that matches the definitions[2] of the identifiers.
>>
>> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
>> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
>
> This may be better - APR has a function for getting the OS default
> encoding. This would work across all platforms that APR supports and I
> believe we already use APR.
>
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e

However, the Windows version of this is simply - return
apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
"CP" + codePageId.

And the Unix version of this method doesn't look very good for our purposes.
>
> -Nathan
>>
>>>
>>>> If we put 936 in the file.encoding, can we successfully get the encoder and
>>>> decoder by charset?
>>>>
>>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org> wrote:
>>>>
>>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com> wrote:
>>>>> > Hi guys,
>>>>> >
>>>>> > I have add the locale function in the drlvm, the patch is attached.
>>>>> Please
>>>>> > try this new patch on the linux.
>>>>> >
>>>>> > The patch should work on the linux but fail on the windows. Because
>>>>> windows
>>>>> > returns code page not charset from the setlocale.
>>>>>
>>>>> Code page and character set are the same thing. We shouldn't need to
>>>>> convert it as the Charset APIs will have to support the values anyway.
>>>>>
>>>>> What's the value you're getting? If it's 'Cp1252', then we're good, as
>>>>> that's just an alias for 'Windows-1252' (or vice-versa).
>>>>>
>>>>> -Nathan
>>>>>
>>>>>
>>>>> > I hv tried long time to
>>>>> > get the charset name from the codepage, for example:
>>>>> > CPINFOEX cpInfoEx;
>>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>>>>> > if (iReturn > 0) {
>>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>>>>> > }
>>>>> > But I only get the full name without any format.
>>>>> >
>>>>> > There is code page identifiers map in the msdn, detail here. I may hard
>>>>> code
>>>>> > this map in the file. But the note on the msdn says:
>>>>> >      "ANSI code pages can be different on different computers, or can be
>>>>> > changed for a single computer, leading to data corruption. For the most
>>>>> > consistent results, applications should use Unicode, such as UTF-8 or
>>>>> > UTF-16, instead of a specific code page."
>>>>> > I am afraid hard-code will fail on some machines. (By the way, this seems
>>>>> > the UTF-8 is suggested to be the default again :-)
>>>>> >
>>>>> > There is also a class Encoding in the VC++, detail here. But we can not
>>>>> use
>>>>> > it here.
>>>>> >
>>>>> > So anyone knows some thing about locale on the windows?
>>>>> > Again, shall use UTF-8 as our default?
>>>>> >
>>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> That seems we should add it in the drlvm.
>>>>> >>
>>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>>>>> >>>
>>>>> >>> Nathan Beyer wrote:
>>>>> >>>>
>>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>>>> >>>> DRLVM?
>>>>> >>>
>>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>>>> >>>
>>>>> >>>>
>>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>>>>> >>>>>
>>>>> >>>>> Kevin Zhou wrote:
>>>>> >>>>>>
>>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>>>>> property
>>>>> >>>>>> adown
>>>>> >>>>>> VM but fails to get the correct encoding.
>>>>> >>>>>>
>>>>> >>>>>> Regis, do you know any other specific ways that CL can gain the
>>>>> right
>>>>> >>>>>> property?
>>>>> >>>>>
>>>>> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>>> >>>>>
>>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>>> Charles Lee wrote:
>>>>> >>>>>>>
>>>>> >>>>>>>> Hi Nanthan,
>>>>> >>>>>>>>
>>>>> >>>>>>>> If the file encoding derive from the OS, it should be the some
>>>>> bugs
>>>>> >>>>>>>> in
>>>>> >>>>>>>> it
>>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>>>> >>>>>>>> codec
>>>>> >>>>>>>> is
>>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>> >>>>>>>>
>>>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
>>>>> so
>>>>> >>>>>>> we
>>>>> >>>>>>> have to do this by ourselves.
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>>>>> >>>>>>>> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>>  Are we talking about windows or linux?the default file encoding
>>>>> >>>>>>>> should
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Sent from my iPhone
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>>>>> >>>>>>>>> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>>>> >>>>>>>>> <fi...@gmail.com>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>> wrote:
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>  Hi,
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
>>>>> it
>>>>> >>>>>>>>>>> sounds
>>>>> >>>>>>>>>>> reasonable.
>>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
>>>>> need
>>>>> >>>>>>>>>>> to
>>>>> >>>>>>>>>>> run
>>>>> >>>>>>>>>>> more tests to verify?
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  Hi guys:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>>>> >>>>>>>>>>>> meeting
>>>>> >>>>>>>>>>>> some
>>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>>>>> different
>>>>> >>>>>>>>>>>> default
>>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>>>> >>>>>>>>>>>> default is
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  UTF-8
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> HARMONY-3736<
>>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
>>>>> get
>>>>> >>>>>>>>>>>> 8859-1.
>>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
>>>>> if
>>>>> >>>>>>>>>>>> we
>>>>> >>>>>>>>>>>> call
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  vm
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  method
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
>>>>> from
>>>>> >>>>>>>>>>>> vm,
>>>>> >>>>>>>>>>>> we
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  set
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>  8859-1.
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>> >>>>>>>>>>>> character.
>>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>>>> >>>>>>>>>>>> "In
>>>>> >>>>>>>>>>>> computing
>>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>>> >>>>>>>>>>>> increasing
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  favor
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>> >>>>>>>>>>> iso8859-1
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> to
>>>>> >>>>>>>>>>>> utf-8?
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> --
>>>>> >>>>>>>>>>>> Yours sincerely,
>>>>> >>>>>>>>>>>> Charles Lee
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>> --
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Best Regards!
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Jimmy, Jing Lv
>>>>> >>>>>>>>>>> China Software Development Lab, IBM
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>> --
>>>>> >>>>>>>>>> Yours sincerely,
>>>>> >>>>>>>>>> Charles Lee
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>
>>>>> >>>>>>> --
>>>>> >>>>>>> Best Regards,
>>>>> >>>>>>> Regis.
>>>>> >>>>>>>
>>>>> >>>>>
>>>>> >>>>> --
>>>>> >>>>> Best Regards,
>>>>> >>>>> Regis.
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Best Regards,
>>>>> >>> Regis.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Yours sincerely,
>>>>> >> Charles Lee
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Yours sincerely,
>>>>> > Charles Lee
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yours sincerely,
>>>> Charles Lee
>>>>
>>>
>>
>

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<nd...@apache.org> wrote:
> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org> wrote:
>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com> wrote:
>>> Hi Nathan,
>>>
>>> What I got is 936, the code page identifier. Is there a api for us to map
>>> 936 to the gb2312?
>>
>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
>> that into a name of some sort. I'll poke around a bit and see what I
>> can find.
>
> We'll probably just have to put in a mapping ourselves based on the
> documentation. We'd call GetACP [1] and map that to a known alias in
> java.nio.charset that matches the definitions[2] of the identifiers.
>
> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx

This may be better - APR has a function for getting the OS default
encoding. This would work across all platforms that APR supports and I
believe we already use APR.

http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e

-Nathan
>
>>
>>> If we put 936 in the file.encoding, can we successfully get the encoder and
>>> decoder by charset?
>>>
>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org> wrote:
>>>
>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com> wrote:
>>>> > Hi guys,
>>>> >
>>>> > I have add the locale function in the drlvm, the patch is attached.
>>>> Please
>>>> > try this new patch on the linux.
>>>> >
>>>> > The patch should work on the linux but fail on the windows. Because
>>>> windows
>>>> > returns code page not charset from the setlocale.
>>>>
>>>> Code page and character set are the same thing. We shouldn't need to
>>>> convert it as the Charset APIs will have to support the values anyway.
>>>>
>>>> What's the value you're getting? If it's 'Cp1252', then we're good, as
>>>> that's just an alias for 'Windows-1252' (or vice-versa).
>>>>
>>>> -Nathan
>>>>
>>>>
>>>> > I hv tried long time to
>>>> > get the charset name from the codepage, for example:
>>>> > CPINFOEX cpInfoEx;
>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>>>> > if (iReturn > 0) {
>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>>>> > }
>>>> > But I only get the full name without any format.
>>>> >
>>>> > There is code page identifiers map in the msdn, detail here. I may hard
>>>> code
>>>> > this map in the file. But the note on the msdn says:
>>>> >      "ANSI code pages can be different on different computers, or can be
>>>> > changed for a single computer, leading to data corruption. For the most
>>>> > consistent results, applications should use Unicode, such as UTF-8 or
>>>> > UTF-16, instead of a specific code page."
>>>> > I am afraid hard-code will fail on some machines. (By the way, this seems
>>>> > the UTF-8 is suggested to be the default again :-)
>>>> >
>>>> > There is also a class Encoding in the VC++, detail here. But we can not
>>>> use
>>>> > it here.
>>>> >
>>>> > So anyone knows some thing about locale on the windows?
>>>> > Again, shall use UTF-8 as our default?
>>>> >
>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> That seems we should add it in the drlvm.
>>>> >>
>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>>>> >>>
>>>> >>> Nathan Beyer wrote:
>>>> >>>>
>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>>> >>>> DRLVM?
>>>> >>>
>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>>> >>>
>>>> >>>>
>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>>>> >>>>>
>>>> >>>>> Kevin Zhou wrote:
>>>> >>>>>>
>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>>>> property
>>>> >>>>>> adown
>>>> >>>>>> VM but fails to get the correct encoding.
>>>> >>>>>>
>>>> >>>>>> Regis, do you know any other specific ways that CL can gain the
>>>> right
>>>> >>>>>> property?
>>>> >>>>>
>>>> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>> >>>>>
>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>>> >>>>>>
>>>> >>>>>>> Charles Lee wrote:
>>>> >>>>>>>
>>>> >>>>>>>> Hi Nanthan,
>>>> >>>>>>>>
>>>> >>>>>>>> If the file encoding derive from the OS, it should be the some
>>>> bugs
>>>> >>>>>>>> in
>>>> >>>>>>>> it
>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>>> >>>>>>>> codec
>>>> >>>>>>>> is
>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>> >>>>>>>>
>>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
>>>> so
>>>> >>>>>>> we
>>>> >>>>>>> have to do this by ourselves.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>  Are we talking about windows or linux?the default file encoding
>>>> >>>>>>>> should
>>>> >>>>>>>>>
>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Sent from my iPhone
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>>>> >>>>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>>> >>>>>>>>> <fi...@gmail.com>
>>>> >>>>>>>>>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>  Hi,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
>>>> it
>>>> >>>>>>>>>>> sounds
>>>> >>>>>>>>>>> reasonable.
>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
>>>> need
>>>> >>>>>>>>>>> to
>>>> >>>>>>>>>>> run
>>>> >>>>>>>>>>> more tests to verify?
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  Hi guys:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>>> >>>>>>>>>>>> meeting
>>>> >>>>>>>>>>>> some
>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>>>> different
>>>> >>>>>>>>>>>> default
>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>>> >>>>>>>>>>>> default is
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  UTF-8
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> HARMONY-3736<
>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
>>>> get
>>>> >>>>>>>>>>>> 8859-1.
>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
>>>> if
>>>> >>>>>>>>>>>> we
>>>> >>>>>>>>>>>> call
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  vm
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  method
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
>>>> from
>>>> >>>>>>>>>>>> vm,
>>>> >>>>>>>>>>>> we
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  set
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>  8859-1.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>> >>>>>>>>>>>> character.
>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>>> >>>>>>>>>>>> "In
>>>> >>>>>>>>>>>> computing
>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>> >>>>>>>>>>>> increasing
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  favor
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>> >>>>>>>>>>> iso8859-1
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> to
>>>> >>>>>>>>>>>> utf-8?
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> --
>>>> >>>>>>>>>>>> Yours sincerely,
>>>> >>>>>>>>>>>> Charles Lee
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>> --
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Best Regards!
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Jimmy, Jing Lv
>>>> >>>>>>>>>>> China Software Development Lab, IBM
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>> --
>>>> >>>>>>>>>> Yours sincerely,
>>>> >>>>>>>>>> Charles Lee
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Best Regards,
>>>> >>>>>>> Regis.
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Best Regards,
>>>> >>>>> Regis.
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Best Regards,
>>>> >>> Regis.
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Yours sincerely,
>>>> >> Charles Lee
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Yours sincerely,
>>>> > Charles Lee
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Yours sincerely,
>>> Charles Lee
>>>
>>
>

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<nd...@apache.org> wrote:
> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com> wrote:
>> Hi Nathan,
>>
>> What I got is 936, the code page identifier. Is there a api for us to map
>> 936 to the gb2312?
>
> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> that into a name of some sort. I'll poke around a bit and see what I
> can find.

We'll probably just have to put in a mapping ourselves based on the
documentation. We'd call GetACP [1] and map that to a known alias in
java.nio.charset that matches the definitions[2] of the identifiers.

[1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
[2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx

>
>> If we put 936 in the file.encoding, can we successfully get the encoder and
>> decoder by charset?
>>
>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org> wrote:
>>
>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com> wrote:
>>> > Hi guys,
>>> >
>>> > I have add the locale function in the drlvm, the patch is attached.
>>> Please
>>> > try this new patch on the linux.
>>> >
>>> > The patch should work on the linux but fail on the windows. Because
>>> windows
>>> > returns code page not charset from the setlocale.
>>>
>>> Code page and character set are the same thing. We shouldn't need to
>>> convert it as the Charset APIs will have to support the values anyway.
>>>
>>> What's the value you're getting? If it's 'Cp1252', then we're good, as
>>> that's just an alias for 'Windows-1252' (or vice-versa).
>>>
>>> -Nathan
>>>
>>>
>>> > I hv tried long time to
>>> > get the charset name from the codepage, for example:
>>> > CPINFOEX cpInfoEx;
>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>>> > if (iReturn > 0) {
>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>>> > }
>>> > But I only get the full name without any format.
>>> >
>>> > There is code page identifiers map in the msdn, detail here. I may hard
>>> code
>>> > this map in the file. But the note on the msdn says:
>>> >      "ANSI code pages can be different on different computers, or can be
>>> > changed for a single computer, leading to data corruption. For the most
>>> > consistent results, applications should use Unicode, such as UTF-8 or
>>> > UTF-16, instead of a specific code page."
>>> > I am afraid hard-code will fail on some machines. (By the way, this seems
>>> > the UTF-8 is suggested to be the default again :-)
>>> >
>>> > There is also a class Encoding in the VC++, detail here. But we can not
>>> use
>>> > it here.
>>> >
>>> > So anyone knows some thing about locale on the windows?
>>> > Again, shall use UTF-8 as our default?
>>> >
>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com>
>>> wrote:
>>> >>
>>> >> That seems we should add it in the drlvm.
>>> >>
>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>>> >>>
>>> >>> Nathan Beyer wrote:
>>> >>>>
>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>> >>>> DRLVM?
>>> >>>
>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>> >>>
>>> >>>>
>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>>> >>>>>
>>> >>>>> Kevin Zhou wrote:
>>> >>>>>>
>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>>> property
>>> >>>>>> adown
>>> >>>>>> VM but fails to get the correct encoding.
>>> >>>>>>
>>> >>>>>> Regis, do you know any other specific ways that CL can gain the
>>> right
>>> >>>>>> property?
>>> >>>>>
>>> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>> >>>>>
>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>> >>>>>>
>>> >>>>>>> Charles Lee wrote:
>>> >>>>>>>
>>> >>>>>>>> Hi Nanthan,
>>> >>>>>>>>
>>> >>>>>>>> If the file encoding derive from the OS, it should be the some
>>> bugs
>>> >>>>>>>> in
>>> >>>>>>>> it
>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>> >>>>>>>> codec
>>> >>>>>>>> is
>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>> >>>>>>>>
>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
>>> so
>>> >>>>>>> we
>>> >>>>>>> have to do this by ourselves.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>>> >>>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>  Are we talking about windows or linux?the default file encoding
>>> >>>>>>>> should
>>> >>>>>>>>>
>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>> >>>>>>>>>
>>> >>>>>>>>> Sent from my iPhone
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>> >>>>>>>>> <fi...@gmail.com>
>>> >>>>>>>>>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>>  Hi,
>>> >>>>>>>>>>
>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
>>> it
>>> >>>>>>>>>>> sounds
>>> >>>>>>>>>>> reasonable.
>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
>>> need
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>> run
>>> >>>>>>>>>>> more tests to verify?
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  Hi guys:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>> >>>>>>>>>>>> meeting
>>> >>>>>>>>>>>> some
>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>>> different
>>> >>>>>>>>>>>> default
>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>> >>>>>>>>>>>> default is
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  UTF-8
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> HARMONY-3736<
>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
>>> get
>>> >>>>>>>>>>>> 8859-1.
>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
>>> if
>>> >>>>>>>>>>>> we
>>> >>>>>>>>>>>> call
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  vm
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  method
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
>>> from
>>> >>>>>>>>>>>> vm,
>>> >>>>>>>>>>>> we
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  set
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>>> >>>>>>>>>>>
>>> >>>>>>>>>>  8859-1.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>> >>>>>>>>>>>> character.
>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>> >>>>>>>>>>>> "In
>>> >>>>>>>>>>>> computing
>>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>> >>>>>>>>>>>> increasing
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  favor
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>> >>>>>>>>>>> iso8859-1
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> to
>>> >>>>>>>>>>>> utf-8?
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> --
>>> >>>>>>>>>>>> Yours sincerely,
>>> >>>>>>>>>>>> Charles Lee
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>> --
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best Regards!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Jimmy, Jing Lv
>>> >>>>>>>>>>> China Software Development Lab, IBM
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>> --
>>> >>>>>>>>>> Yours sincerely,
>>> >>>>>>>>>> Charles Lee
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Best Regards,
>>> >>>>>>> Regis.
>>> >>>>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> Best Regards,
>>> >>>>> Regis.
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Best Regards,
>>> >>> Regis.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Yours sincerely,
>>> >> Charles Lee
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Yours sincerely,
>>> > Charles Lee
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Yours sincerely,
>> Charles Lee
>>
>

Re: Shall we change our file.encoding

Posted by Alexey Varlamov <al...@gmail.com>.

2009/7/17, Nathan Beyer <nb...@gmail.com>:
> On Thu, Jul 16, 2009 at 2:27 AM, Alexey
> Varlamov<al...@gmail.com> wrote:
> > The main point of the HARMONY-3736 was: why any VM should care about
> > classlib-specific properties? Let classlib do it, not DRLVM.
>
> Can you point to some conversation that backs this up? I looked at
> that issue and I don't interpret it like you do.
Well, probably it was not put prominent enough in that issue. But the
idea is simple - we should minimize VM porting interface. Things like
detection of system timezone/locale/encoding are VM-agnostic and even
never usable in VM.
Call it modularity improvement ;)

>
> In any case, it looks like this work should be done on this issue,
> since it's what we're talking about -
> https://issues.apache.org/jira/browse/HARMONY-3829.
Ah - yes, thanks for pointing this out.

--
Alexey

>
> -Nathan
>

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nb...@gmail.com>.

On Thu, Jul 16, 2009 at 2:27 AM, Alexey
Varlamov<al...@gmail.com> wrote:
> The main point of the HARMONY-3736 was: why any VM should care about
> classlib-specific properties? Let classlib do it, not DRLVM.

Can you point to some conversation that backs this up? I looked at
that issue and I don't interpret it like you do.

In any case, it looks like this work should be done on this issue,
since it's what we're talking about -
https://issues.apache.org/jira/browse/HARMONY-3829.

-Nathan

>
> Regards,
> Alexey
>
> 2009/7/16, Charles Lee <li...@gmail.com>:
>> Hi guys,
>>
>> I have add the locale function in the drlvm, the patch is attached. Please
>> try this new patch on the linux.
>>
>> The patch should work on the linux but fail on the windows. Because windows
>> returns code page not charset from the setlocale. I hv tried long time to
>> get the charset name from the codepage, for example:
>> CPINFOEX cpInfoEx;
>> BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>> if (iReturn > 0) {
>>     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>> }
>> But I only get the full name without any format.
>>
>> There is code page identifiers map in the msdn, detail here. I may hard code
>> this map in the file. But the note on the msdn says:
>>      "ANSI code pages can be different on different computers, or can be
>> changed for a single computer, leading to data corruption. For the most
>> consistent results, applications should use Unicode, such as UTF-8 or
>> UTF-16, instead of a specific code page."
>> I am afraid hard-code will fail on some machines. (By the way, this seems
>> the UTF-8 is suggested to be the default again :-)
>>
>> There is also a class Encoding in the VC++, detail here. But we can not use
>> it here.
>>
>> So anyone knows some thing about locale on the windows?
>> Again, shall use UTF-8 as our default?
>>
>>
>> On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com> wrote:
>> > That seems we should add it in the drlvm.
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>> >
>> > >
>> > > Nathan Beyer wrote:
>> > >
>> > > > Is the IBM VME dealing with this correctly? Do we just need to fix
>> DRLVM?
>> > > >
>> > >
>> > > Yes, I only tested on Linux, IBM VME set the property correctly.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>> > > >
>> > > > > Kevin Zhou wrote:
>> > > > >
>> > > > > > Yea, from luniglob.c, CL attempts to read the "file.encoding"
>> property
>> > > > > > adown
>> > > > > > VM but fails to get the correct encoding.
>> > > > > >
>> > > > > > Regis, do you know any other specific ways that CL can gain the
>> right
>> > > > > > property?
>> > > > > >
>> > > > > We can get from OS directly. Maybe just read env variables on Linux?
>> > > > >
>> > > > >
>> > > > > > Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>> > > > > >
>> > > > > >
>> > > > > > > Charles Lee wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > > Hi Nanthan,
>> > > > > > > >
>> > > > > > > > If the file encoding derive from the OS, it should be the some
>> bugs in
>> > > > > > > > it
>> > > > > > > > because on my LINUX machine the locale is en_US.UTF-8. Our
>> default codec
>> > > > > > > > is
>> > > > > > > > still ISO8859-1. Do you know where can we found such codes?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > Classlib expected vm do this and set the property, but it
>> didn't, so we
>> > > > > > > have to do this by ourselves.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > > On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer
>> <nb...@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > >  Are we talking about windows or linux?the default file
>> encoding should
>> > > > > > > >
>> > > > > > > > > derive from the OS. I believe that's defined by the specs.
>> > > > > > > > >
>> > > > > > > > > Sent from my iPhone
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Jul 14, 2009, at 5:51 AM, Charles Lee
>> <li...@gmail.com> wrote:
>> > > > > > > > >
>> > > > > > > > >  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>> <fi...@gmail.com>
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > >  Hi,
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > >  Charles, I believe UTF-8 is the default encoding for
>> RI, and it
>> > > > > > > > > > > sounds
>> > > > > > > > > > > reasonable.
>> > > > > > > > > > >  BTW, it may encounter some compatibility problem, maybe
>> we need to
>> > > > > > > > > > > run
>> > > > > > > > > > > more tests to verify?
>> > > > > > > > > > >
>> > > > > > > > > > > 2009/7/14 Charles Lee <li...@gmail.com>
>> > > > > > > > > > >
>> > > > > > > > > > >  Hi guys:
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > I am doing some test cases on the ant junit test case
>> and meeting
>> > > > > > > > > > > > some
>> > > > > > > > > > > > encoding problems. I find they are maybe caused by the
>> different
>> > > > > > > > > > > > default
>> > > > > > > > > > > > encoding from RI and harmony. My local is en_US.UTF-8,
>> RI default is
>> > > > > > > > > > > >
>> > > > > > > > > > > >  UTF-8
>> > > > > > > > > > > >
>> > > > > > > > > > >  but harmony is 8859-1. And then I have encountered
>> > > > > > > > > > >
>> > > > > > > > > > > >
>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>> > > > > > > > > > > > and the two diffs attached on that issue. It seems we
>> always get
>> > > > > > > > > > > > 8859-1.
>> > > > > > > > > > > > Because: (correct me if wrong :-)
>> > > > > > > > > > > >
>> > > > > > > > > > > > 1. we remove the set code in the vm. we will always
>> get null if we
>> > > > > > > > > > > > call
>> > > > > > > > > > > >
>> > > > > > > > > > > >  vm
>> > > > > > > > > > > >
>> > > > > > > > > > >  method
>> > > > > > > > > > >
>> > > > > > > > > > > > 2. we set the file.encode in the libglob.c, if we got
>> null from vm,
>> > > > > > > > > > > > we
>> > > > > > > > > > > >
>> > > > > > > > > > > >  set
>> > > > > > > > > > > >
>> > > > > > > > > > >  Sorry, it should be luniglob.c
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >  8859-1.
>> > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > 3. we can not set file.encode on the run time.
>> > > > > > > > > > > >
>> > > > > > > > > > > > ant use UTF-8 to encode filename which contains the
>> non-ascii
>> > > > > > > > > > > > character.
>> > > > > > > > > > > > So why we use iso8859-1 as our unchangeable default?
>> > > > > > > > > > > > From the wiki
>> http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>> > > > > > > > > > > > computing
>> > > > > > > > > > > > applications, encodings that provide full UCS support
>> (such as
>> > > > > > > > > > > >
>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>> > > > > > > > > > > > UTF-16
>> <http://en.wikipedia.org/wiki/UTF-16>) are finding
>> increasing
>> > > > > > > > > > > >
>> > > > > > > > > > > >  favor
>> > > > > > > > > > > >
>> > > > > > > > > > >  over encodings based on ISO 8859-1." Should we simply
>> change
>> > > > > > > > > > > iso8859-1
>> > > > > > > > > > >
>> > > > > > > > > > > > to
>> > > > > > > > > > > > utf-8?
>> > > > > > > > > > > >
>> > > > > > > > > > > > --
>> > > > > > > > > > > > Yours sincerely,
>> > > > > > > > > > > > Charles Lee
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > >
>> > > > > > > > > > > Best Regards!
>> > > > > > > > > > >
>> > > > > > > > > > > Jimmy, Jing Lv
>> > > > > > > > > > > China Software Development Lab, IBM
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > Yours sincerely,
>> > > > > > > > > > Charles Lee
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > --
>> > > > > > > Best Regards,
>> > > > > > > Regis.
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > > --
>> > > > > Best Regards,
>> > > > > Regis.
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Best Regards,
>> > > Regis.
>> > >
>> >
>> >
>> >
>> > --
>> > Yours sincerely,
>> > Charles Lee
>> >
>> >
>>
>>
>>
>> --
>> Yours sincerely,
>> Charles Lee
>>
>>
>>
>

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<li...@gmail.com> wrote:
> Hi Nathan,
>
> What I got is 936, the code page identifier. Is there a api for us to map
> 936 to the gb2312?

Oh, the 'identifier' bit was missing - yeah, we'll need to translate
that into a name of some sort. I'll poke around a bit and see what I
can find.

> If we put 936 in the file.encoding, can we successfully get the encoder and
> decoder by charset?
>
> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org> wrote:
>
>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com> wrote:
>> > Hi guys,
>> >
>> > I have add the locale function in the drlvm, the patch is attached.
>> Please
>> > try this new patch on the linux.
>> >
>> > The patch should work on the linux but fail on the windows. Because
>> windows
>> > returns code page not charset from the setlocale.
>>
>> Code page and character set are the same thing. We shouldn't need to
>> convert it as the Charset APIs will have to support the values anyway.
>>
>> What's the value you're getting? If it's 'Cp1252', then we're good, as
>> that's just an alias for 'Windows-1252' (or vice-versa).
>>
>> -Nathan
>>
>>
>> > I hv tried long time to
>> > get the charset name from the codepage, for example:
>> > CPINFOEX cpInfoEx;
>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>> > if (iReturn > 0) {
>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>> > }
>> > But I only get the full name without any format.
>> >
>> > There is code page identifiers map in the msdn, detail here. I may hard
>> code
>> > this map in the file. But the note on the msdn says:
>> >      "ANSI code pages can be different on different computers, or can be
>> > changed for a single computer, leading to data corruption. For the most
>> > consistent results, applications should use Unicode, such as UTF-8 or
>> > UTF-16, instead of a specific code page."
>> > I am afraid hard-code will fail on some machines. (By the way, this seems
>> > the UTF-8 is suggested to be the default again :-)
>> >
>> > There is also a class Encoding in the VC++, detail here. But we can not
>> use
>> > it here.
>> >
>> > So anyone knows some thing about locale on the windows?
>> > Again, shall use UTF-8 as our default?
>> >
>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com>
>> wrote:
>> >>
>> >> That seems we should add it in the drlvm.
>> >>
>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>> >>>
>> >>> Nathan Beyer wrote:
>> >>>>
>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>> >>>> DRLVM?
>> >>>
>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>> >>>
>> >>>>
>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>> >>>>>
>> >>>>> Kevin Zhou wrote:
>> >>>>>>
>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>> property
>> >>>>>> adown
>> >>>>>> VM but fails to get the correct encoding.
>> >>>>>>
>> >>>>>> Regis, do you know any other specific ways that CL can gain the
>> right
>> >>>>>> property?
>> >>>>>
>> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
>> >>>>>
>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>> >>>>>>
>> >>>>>>> Charles Lee wrote:
>> >>>>>>>
>> >>>>>>>> Hi Nanthan,
>> >>>>>>>>
>> >>>>>>>> If the file encoding derive from the OS, it should be the some
>> bugs
>> >>>>>>>> in
>> >>>>>>>> it
>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>> >>>>>>>> codec
>> >>>>>>>> is
>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>> >>>>>>>>
>> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
>> so
>> >>>>>>> we
>> >>>>>>> have to do this by ourselves.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>  Are we talking about windows or linux?the default file encoding
>> >>>>>>>> should
>> >>>>>>>>>
>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>> >>>>>>>>>
>> >>>>>>>>> Sent from my iPhone
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>> >>>>>>>>> <fi...@gmail.com>
>> >>>>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>  Hi,
>> >>>>>>>>>>
>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
>> it
>> >>>>>>>>>>> sounds
>> >>>>>>>>>>> reasonable.
>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
>> need
>> >>>>>>>>>>> to
>> >>>>>>>>>>> run
>> >>>>>>>>>>> more tests to verify?
>> >>>>>>>>>>>
>> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>> >>>>>>>>>>>
>> >>>>>>>>>>>  Hi guys:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>> >>>>>>>>>>>> meeting
>> >>>>>>>>>>>> some
>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>> different
>> >>>>>>>>>>>> default
>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>> >>>>>>>>>>>> default is
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>  UTF-8
>> >>>>>>>>>>>
>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> HARMONY-3736<
>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
>> get
>> >>>>>>>>>>>> 8859-1.
>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
>> if
>> >>>>>>>>>>>> we
>> >>>>>>>>>>>> call
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>  vm
>> >>>>>>>>>>>
>> >>>>>>>>>>>  method
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
>> from
>> >>>>>>>>>>>> vm,
>> >>>>>>>>>>>> we
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>  set
>> >>>>>>>>>>>
>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>> >>>>>>>>>>>
>> >>>>>>>>>>  8859-1.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>> >>>>>>>>>>>> character.
>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>> >>>>>>>>>>>> "In
>> >>>>>>>>>>>> computing
>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>> >>>>>>>>>>>> increasing
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>  favor
>> >>>>>>>>>>>
>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>> >>>>>>>>>>> iso8859-1
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> to
>> >>>>>>>>>>>> utf-8?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> --
>> >>>>>>>>>>>> Yours sincerely,
>> >>>>>>>>>>>> Charles Lee
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best Regards!
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jimmy, Jing Lv
>> >>>>>>>>>>> China Software Development Lab, IBM
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>> Yours sincerely,
>> >>>>>>>>>> Charles Lee
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>> --
>> >>>>>>> Best Regards,
>> >>>>>>> Regis.
>> >>>>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Best Regards,
>> >>>>> Regis.
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Best Regards,
>> >>> Regis.
>> >>
>> >>
>> >>
>> >> --
>> >> Yours sincerely,
>> >> Charles Lee
>> >>
>> >
>> >
>> >
>> > --
>> > Yours sincerely,
>> > Charles Lee
>> >
>> >
>>
>
>
>
> --
> Yours sincerely,
> Charles Lee
>

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Hi Nathan,

What I got is 936, the code page identifier. Is there a api for us to map
936 to the gb2312?
If we put 936 in the file.encoding, can we successfully get the encoder and
decoder by charset?

On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <nd...@apache.org> wrote:

> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com> wrote:
> > Hi guys,
> >
> > I have add the locale function in the drlvm, the patch is attached.
> Please
> > try this new patch on the linux.
> >
> > The patch should work on the linux but fail on the windows. Because
> windows
> > returns code page not charset from the setlocale.
>
> Code page and character set are the same thing. We shouldn't need to
> convert it as the Charset APIs will have to support the values anyway.
>
> What's the value you're getting? If it's 'Cp1252', then we're good, as
> that's just an alias for 'Windows-1252' (or vice-versa).
>
> -Nathan
>
>
> > I hv tried long time to
> > get the charset name from the codepage, for example:
> > CPINFOEX cpInfoEx;
> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> > if (iReturn > 0) {
> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> > }
> > But I only get the full name without any format.
> >
> > There is code page identifiers map in the msdn, detail here. I may hard
> code
> > this map in the file. But the note on the msdn says:
> >      "ANSI code pages can be different on different computers, or can be
> > changed for a single computer, leading to data corruption. For the most
> > consistent results, applications should use Unicode, such as UTF-8 or
> > UTF-16, instead of a specific code page."
> > I am afraid hard-code will fail on some machines. (By the way, this seems
> > the UTF-8 is suggested to be the default again :-)
> >
> > There is also a class Encoding in the VC++, detail here. But we can not
> use
> > it here.
> >
> > So anyone knows some thing about locale on the windows?
> > Again, shall use UTF-8 as our default?
> >
> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com>
> wrote:
> >>
> >> That seems we should add it in the drlvm.
> >>
> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
> >>>
> >>> Nathan Beyer wrote:
> >>>>
> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
> >>>> DRLVM?
> >>>
> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
> >>>
> >>>>
> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
> >>>>>
> >>>>> Kevin Zhou wrote:
> >>>>>>
> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
> property
> >>>>>> adown
> >>>>>> VM but fails to get the correct encoding.
> >>>>>>
> >>>>>> Regis, do you know any other specific ways that CL can gain the
> right
> >>>>>> property?
> >>>>>
> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
> >>>>>
> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Charles Lee wrote:
> >>>>>>>
> >>>>>>>> Hi Nanthan,
> >>>>>>>>
> >>>>>>>> If the file encoding derive from the OS, it should be the some
> bugs
> >>>>>>>> in
> >>>>>>>> it
> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
> >>>>>>>> codec
> >>>>>>>> is
> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
> >>>>>>>>
> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
> so
> >>>>>>> we
> >>>>>>> have to do this by ourselves.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>  Are we talking about windows or linux?the default file encoding
> >>>>>>>> should
> >>>>>>>>>
> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
> >>>>>>>>>
> >>>>>>>>> Sent from my iPhone
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
> >>>>>>>>> <fi...@gmail.com>
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>  Hi,
> >>>>>>>>>>
> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
> it
> >>>>>>>>>>> sounds
> >>>>>>>>>>> reasonable.
> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
> need
> >>>>>>>>>>> to
> >>>>>>>>>>> run
> >>>>>>>>>>> more tests to verify?
> >>>>>>>>>>>
> >>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>>>  Hi guys:
> >>>>>>>>>>>
> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
> >>>>>>>>>>>> meeting
> >>>>>>>>>>>> some
> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
> different
> >>>>>>>>>>>> default
> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
> >>>>>>>>>>>> default is
> >>>>>>>>>>>>
> >>>>>>>>>>>>  UTF-8
> >>>>>>>>>>>
> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> HARMONY-3736<
> https://issues.apache.org/jira/browse/HARMONY-3736>,
> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
> get
> >>>>>>>>>>>> 8859-1.
> >>>>>>>>>>>> Because: (correct me if wrong :-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
> if
> >>>>>>>>>>>> we
> >>>>>>>>>>>> call
> >>>>>>>>>>>>
> >>>>>>>>>>>>  vm
> >>>>>>>>>>>
> >>>>>>>>>>>  method
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
> from
> >>>>>>>>>>>> vm,
> >>>>>>>>>>>> we
> >>>>>>>>>>>>
> >>>>>>>>>>>>  set
> >>>>>>>>>>>
> >>>>>>>>>>>  Sorry, it should be luniglob.c
> >>>>>>>>>>>
> >>>>>>>>>>  8859-1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
> >>>>>>>>>>>> character.
> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
> >>>>>>>>>>>> "In
> >>>>>>>>>>>> computing
> >>>>>>>>>>>> applications, encodings that provide full UCS support (such as
> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
> >>>>>>>>>>>> increasing
> >>>>>>>>>>>>
> >>>>>>>>>>>>  favor
> >>>>>>>>>>>
> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
> >>>>>>>>>>> iso8859-1
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>> utf-8?
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Yours sincerely,
> >>>>>>>>>>>> Charles Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>>
> >>>>>>>>>>> Best Regards!
> >>>>>>>>>>>
> >>>>>>>>>>> Jimmy, Jing Lv
> >>>>>>>>>>> China Software Development Lab, IBM
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Yours sincerely,
> >>>>>>>>>> Charles Lee
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>> --
> >>>>>>> Best Regards,
> >>>>>>> Regis.
> >>>>>>>
> >>>>>
> >>>>> --
> >>>>> Best Regards,
> >>>>> Regis.
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best Regards,
> >>> Regis.
> >>
> >>
> >>
> >> --
> >> Yours sincerely,
> >> Charles Lee
> >>
> >
> >
> >
> > --
> > Yours sincerely,
> > Charles Lee
> >
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<li...@gmail.com> wrote:
> Hi guys,
>
> I have add the locale function in the drlvm, the patch is attached. Please
> try this new patch on the linux.
>
> The patch should work on the linux but fail on the windows. Because windows
> returns code page not charset from the setlocale.

Code page and character set are the same thing. We shouldn't need to
convert it as the Charset APIs will have to support the values anyway.

What's the value you're getting? If it's 'Cp1252', then we're good, as
that's just an alias for 'Windows-1252' (or vice-versa).

-Nathan


> I hv tried long time to
> get the charset name from the codepage, for example:
> CPINFOEX cpInfoEx;
> BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> if (iReturn > 0) {
>     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> }
> But I only get the full name without any format.
>
> There is code page identifiers map in the msdn, detail here. I may hard code
> this map in the file. But the note on the msdn says:
>      "ANSI code pages can be different on different computers, or can be
> changed for a single computer, leading to data corruption. For the most
> consistent results, applications should use Unicode, such as UTF-8 or
> UTF-16, instead of a specific code page."
> I am afraid hard-code will fail on some machines. (By the way, this seems
> the UTF-8 is suggested to be the default again :-)
>
> There is also a class Encoding in the VC++, detail here. But we can not use
> it here.
>
> So anyone knows some thing about locale on the windows?
> Again, shall use UTF-8 as our default?
>
> On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com> wrote:
>>
>> That seems we should add it in the drlvm.
>>
>> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>>>
>>> Nathan Beyer wrote:
>>>>
>>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>>> DRLVM?
>>>
>>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>>
>>>>
>>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>>>>>
>>>>> Kevin Zhou wrote:
>>>>>>
>>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" property
>>>>>> adown
>>>>>> VM but fails to get the correct encoding.
>>>>>>
>>>>>> Regis, do you know any other specific ways that CL can gain the right
>>>>>> property?
>>>>>
>>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>>>
>>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>>>>>
>>>>>>> Charles Lee wrote:
>>>>>>>
>>>>>>>> Hi Nanthan,
>>>>>>>>
>>>>>>>> If the file encoding derive from the OS, it should be the some bugs
>>>>>>>> in
>>>>>>>> it
>>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>>>>>>> codec
>>>>>>>> is
>>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>>>>>
>>>>>>> Classlib expected vm do this and set the property, but it didn't, so
>>>>>>> we
>>>>>>> have to do this by ourselves.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Are we talking about windows or linux?the default file encoding
>>>>>>>> should
>>>>>>>>>
>>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>>>>>>>> <fi...@gmail.com>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and it
>>>>>>>>>>> sounds
>>>>>>>>>>> reasonable.
>>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we need
>>>>>>>>>>> to
>>>>>>>>>>> run
>>>>>>>>>>> more tests to verify?
>>>>>>>>>>>
>>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>>  Hi guys:
>>>>>>>>>>>
>>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>>>>>>>>>>> meeting
>>>>>>>>>>>> some
>>>>>>>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>>>>>>>> default
>>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>>>>>>>>>>> default is
>>>>>>>>>>>>
>>>>>>>>>>>>  UTF-8
>>>>>>>>>>>
>>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>>>>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>>>>>>>> 8859-1.
>>>>>>>>>>>> Because: (correct me if wrong :-)
>>>>>>>>>>>>
>>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null if
>>>>>>>>>>>> we
>>>>>>>>>>>> call
>>>>>>>>>>>>
>>>>>>>>>>>>  vm
>>>>>>>>>>>
>>>>>>>>>>>  method
>>>>>>>>>>>>
>>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from
>>>>>>>>>>>> vm,
>>>>>>>>>>>> we
>>>>>>>>>>>>
>>>>>>>>>>>>  set
>>>>>>>>>>>
>>>>>>>>>>>  Sorry, it should be luniglob.c
>>>>>>>>>>>
>>>>>>>>>>  8859-1.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>>>>>>>>>>>
>>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>>>>>>>> character.
>>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>>>>>>>>>>> "In
>>>>>>>>>>>> computing
>>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>>>>>>>>>> increasing
>>>>>>>>>>>>
>>>>>>>>>>>>  favor
>>>>>>>>>>>
>>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>>>>>>>> iso8859-1
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>> utf-8?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>> Charles Lee
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Best Regards!
>>>>>>>>>>>
>>>>>>>>>>> Jimmy, Jing Lv
>>>>>>>>>>> China Software Development Lab, IBM
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Yours sincerely,
>>>>>>>>>> Charles Lee
>>>>>>>>>>
>>>>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Regis.
>>>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Regis.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Regis.
>>
>>
>>
>> --
>> Yours sincerely,
>> Charles Lee
>>
>
>
>
> --
> Yours sincerely,
> Charles Lee
>
>

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Thanks Alexey,

I can move the codes to luniglob.c. It's not the big problem to me. How to
get the charset on windows is my main point.
Any idea about it?


On Thu, Jul 16, 2009 at 3:27 PM, Alexey Varlamov <
alexey.v.varlamov@gmail.com> wrote:

> The main point of the HARMONY-3736 was: why any VM should care about
> classlib-specific properties? Let classlib do it, not DRLVM.
>
> Regards,
> Alexey
>
> 2009/7/16, Charles Lee <li...@gmail.com>:
> > Hi guys,
> >
> > I have add the locale function in the drlvm, the patch is attached.
> Please
> > try this new patch on the linux.
> >
> > The patch should work on the linux but fail on the windows. Because
> windows
> > returns code page not charset from the setlocale. I hv tried long time to
> > get the charset name from the codepage, for example:
> > CPINFOEX cpInfoEx;
> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> > if (iReturn > 0) {
> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> > }
> > But I only get the full name without any format.
> >
> > There is code page identifiers map in the msdn, detail here. I may hard
> code
> > this map in the file. But the note on the msdn says:
> >      "ANSI code pages can be different on different computers, or can be
> > changed for a single computer, leading to data corruption. For the most
> > consistent results, applications should use Unicode, such as UTF-8 or
> > UTF-16, instead of a specific code page."
> > I am afraid hard-code will fail on some machines. (By the way, this seems
> > the UTF-8 is suggested to be the default again :-)
> >
> > There is also a class Encoding in the VC++, detail here. But we can not
> use
> > it here.
> >
> > So anyone knows some thing about locale on the windows?
> > Again, shall use UTF-8 as our default?
> >
> >
> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com>
> wrote:
> > > That seems we should add it in the drlvm.
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
> > >
> > > >
> > > > Nathan Beyer wrote:
> > > >
> > > > > Is the IBM VME dealing with this correctly? Do we just need to fix
> > DRLVM?
> > > > >
> > > >
> > > > Yes, I only tested on Linux, IBM VME set the property correctly.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
> > > > >
> > > > > > Kevin Zhou wrote:
> > > > > >
> > > > > > > Yea, from luniglob.c, CL attempts to read the "file.encoding"
> > property
> > > > > > > adown
> > > > > > > VM but fails to get the correct encoding.
> > > > > > >
> > > > > > > Regis, do you know any other specific ways that CL can gain the
> > right
> > > > > > > property?
> > > > > > >
> > > > > > We can get from OS directly. Maybe just read env variables on
> Linux?
> > > > > >
> > > > > >
> > > > > > > Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com>
> wrote:
> > > > > > >
> > > > > > >
> > > > > > > > Charles Lee wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > Hi Nanthan,
> > > > > > > > >
> > > > > > > > > If the file encoding derive from the OS, it should be the
> some
> > bugs in
> > > > > > > > > it
> > > > > > > > > because on my LINUX machine the locale is en_US.UTF-8. Our
> > default codec
> > > > > > > > > is
> > > > > > > > > still ISO8859-1. Do you know where can we found such codes?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > Classlib expected vm do this and set the property, but it
> > didn't, so we
> > > > > > > > have to do this by ourselves.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer
> > <nb...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > >  Are we talking about windows or linux?the default file
> > encoding should
> > > > > > > > >
> > > > > > > > > > derive from the OS. I believe that's defined by the
> specs.
> > > > > > > > > >
> > > > > > > > > > Sent from my iPhone
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Jul 14, 2009, at 5:51 AM, Charles Lee
> > <li...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > >  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
> > <fi...@gmail.com>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >  Hi,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >  Charles, I believe UTF-8 is the default encoding for
> > RI, and it
> > > > > > > > > > > > sounds
> > > > > > > > > > > > reasonable.
> > > > > > > > > > > >  BTW, it may encounter some compatibility problem,
> maybe
> > we need to
> > > > > > > > > > > > run
> > > > > > > > > > > > more tests to verify?
> > > > > > > > > > > >
> > > > > > > > > > > > 2009/7/14 Charles Lee <li...@gmail.com>
> > > > > > > > > > > >
> > > > > > > > > > > >  Hi guys:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > I am doing some test cases on the ant junit test
> case
> > and meeting
> > > > > > > > > > > > > some
> > > > > > > > > > > > > encoding problems. I find they are maybe caused by
> the
> > different
> > > > > > > > > > > > > default
> > > > > > > > > > > > > encoding from RI and harmony. My local is
> en_US.UTF-8,
> > RI default is
> > > > > > > > > > > > >
> > > > > > > > > > > > >  UTF-8
> > > > > > > > > > > > >
> > > > > > > > > > > >  but harmony is 8859-1. And then I have encountered
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
> > > > > > > > > > > > > and the two diffs attached on that issue. It seems
> we
> > always get
> > > > > > > > > > > > > 8859-1.
> > > > > > > > > > > > > Because: (correct me if wrong :-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. we remove the set code in the vm. we will always
> > get null if we
> > > > > > > > > > > > > call
> > > > > > > > > > > > >
> > > > > > > > > > > > >  vm
> > > > > > > > > > > > >
> > > > > > > > > > > >  method
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. we set the file.encode in the libglob.c, if we
> got
> > null from vm,
> > > > > > > > > > > > > we
> > > > > > > > > > > > >
> > > > > > > > > > > > >  set
> > > > > > > > > > > > >
> > > > > > > > > > > >  Sorry, it should be luniglob.c
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >  8859-1.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 3. we can not set file.encode on the run time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ant use UTF-8 to encode filename which contains the
> > non-ascii
> > > > > > > > > > > > > character.
> > > > > > > > > > > > > So why we use iso8859-1 as our unchangeable
> default?
> > > > > > > > > > > > > From the wiki
> > http://en.wikipedia.org/wiki/ISO8859-1, it says "In
> > > > > > > > > > > > > computing
> > > > > > > > > > > > > applications, encodings that provide full UCS
> support
> > (such as
> > > > > > > > > > > > >
> > UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> > > > > > > > > > > > > UTF-16
> > <http://en.wikipedia.org/wiki/UTF-16>) are finding
> > increasing
> > > > > > > > > > > > >
> > > > > > > > > > > > >  favor
> > > > > > > > > > > > >
> > > > > > > > > > > >  over encodings based on ISO 8859-1." Should we
> simply
> > change
> > > > > > > > > > > > iso8859-1
> > > > > > > > > > > >
> > > > > > > > > > > > > to
> > > > > > > > > > > > > utf-8?
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Yours sincerely,
> > > > > > > > > > > > > Charles Lee
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards!
> > > > > > > > > > > >
> > > > > > > > > > > > Jimmy, Jing Lv
> > > > > > > > > > > > China Software Development Lab, IBM
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Yours sincerely,
> > > > > > > > > > > Charles Lee
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > --
> > > > > > > > Best Regards,
> > > > > > > > Regis.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > > Regis.
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > > Regis.
> > > >
> > >
> > >
> > >
> > > --
> > > Yours sincerely,
> > > Charles Lee
> > >
> > >
> >
> >
> >
> > --
> > Yours sincerely,
> > Charles Lee
> >
> >
> >
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Alexey Varlamov <al...@gmail.com>.

The main point of the HARMONY-3736 was: why any VM should care about
classlib-specific properties? Let classlib do it, not DRLVM.

Regards,
Alexey

2009/7/16, Charles Lee <li...@gmail.com>:
> Hi guys,
>
> I have add the locale function in the drlvm, the patch is attached. Please
> try this new patch on the linux.
>
> The patch should work on the linux but fail on the windows. Because windows
> returns code page not charset from the setlocale. I hv tried long time to
> get the charset name from the codepage, for example:
> CPINFOEX cpInfoEx;
> BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> if (iReturn > 0) {
>     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> }
> But I only get the full name without any format.
>
> There is code page identifiers map in the msdn, detail here. I may hard code
> this map in the file. But the note on the msdn says:
>      "ANSI code pages can be different on different computers, or can be
> changed for a single computer, leading to data corruption. For the most
> consistent results, applications should use Unicode, such as UTF-8 or
> UTF-16, instead of a specific code page."
> I am afraid hard-code will fail on some machines. (By the way, this seems
> the UTF-8 is suggested to be the default again :-)
>
> There is also a class Encoding in the VC++, detail here. But we can not use
> it here.
>
> So anyone knows some thing about locale on the windows?
> Again, shall use UTF-8 as our default?
>
>
> On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com> wrote:
> > That seems we should add it in the drlvm.
> >
> >
> >
> >
> >
> > On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
> >
> > >
> > > Nathan Beyer wrote:
> > >
> > > > Is the IBM VME dealing with this correctly? Do we just need to fix
> DRLVM?
> > > >
> > >
> > > Yes, I only tested on Linux, IBM VME set the property correctly.
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
> > > >
> > > > > Kevin Zhou wrote:
> > > > >
> > > > > > Yea, from luniglob.c, CL attempts to read the "file.encoding"
> property
> > > > > > adown
> > > > > > VM but fails to get the correct encoding.
> > > > > >
> > > > > > Regis, do you know any other specific ways that CL can gain the
> right
> > > > > > property?
> > > > > >
> > > > > We can get from OS directly. Maybe just read env variables on Linux?
> > > > >
> > > > >
> > > > > > Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > > Charles Lee wrote:
> > > > > > >
> > > > > > >
> > > > > > > > Hi Nanthan,
> > > > > > > >
> > > > > > > > If the file encoding derive from the OS, it should be the some
> bugs in
> > > > > > > > it
> > > > > > > > because on my LINUX machine the locale is en_US.UTF-8. Our
> default codec
> > > > > > > > is
> > > > > > > > still ISO8859-1. Do you know where can we found such codes?
> > > > > > > >
> > > > > > > >
> > > > > > > Classlib expected vm do this and set the property, but it
> didn't, so we
> > > > > > > have to do this by ourselves.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer
> <nb...@gmail.com> wrote:
> > > > > > > >
> > > > > > > >  Are we talking about windows or linux?the default file
> encoding should
> > > > > > > >
> > > > > > > > > derive from the OS. I believe that's defined by the specs.
> > > > > > > > >
> > > > > > > > > Sent from my iPhone
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Jul 14, 2009, at 5:51 AM, Charles Lee
> <li...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > >  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
> <fi...@gmail.com>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >  Hi,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >  Charles, I believe UTF-8 is the default encoding for
> RI, and it
> > > > > > > > > > > sounds
> > > > > > > > > > > reasonable.
> > > > > > > > > > >  BTW, it may encounter some compatibility problem, maybe
> we need to
> > > > > > > > > > > run
> > > > > > > > > > > more tests to verify?
> > > > > > > > > > >
> > > > > > > > > > > 2009/7/14 Charles Lee <li...@gmail.com>
> > > > > > > > > > >
> > > > > > > > > > >  Hi guys:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > I am doing some test cases on the ant junit test case
> and meeting
> > > > > > > > > > > > some
> > > > > > > > > > > > encoding problems. I find they are maybe caused by the
> different
> > > > > > > > > > > > default
> > > > > > > > > > > > encoding from RI and harmony. My local is en_US.UTF-8,
> RI default is
> > > > > > > > > > > >
> > > > > > > > > > > >  UTF-8
> > > > > > > > > > > >
> > > > > > > > > > >  but harmony is 8859-1. And then I have encountered
> > > > > > > > > > >
> > > > > > > > > > > >
> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
> > > > > > > > > > > > and the two diffs attached on that issue. It seems we
> always get
> > > > > > > > > > > > 8859-1.
> > > > > > > > > > > > Because: (correct me if wrong :-)
> > > > > > > > > > > >
> > > > > > > > > > > > 1. we remove the set code in the vm. we will always
> get null if we
> > > > > > > > > > > > call
> > > > > > > > > > > >
> > > > > > > > > > > >  vm
> > > > > > > > > > > >
> > > > > > > > > > >  method
> > > > > > > > > > >
> > > > > > > > > > > > 2. we set the file.encode in the libglob.c, if we got
> null from vm,
> > > > > > > > > > > > we
> > > > > > > > > > > >
> > > > > > > > > > > >  set
> > > > > > > > > > > >
> > > > > > > > > > >  Sorry, it should be luniglob.c
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >  8859-1.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 3. we can not set file.encode on the run time.
> > > > > > > > > > > >
> > > > > > > > > > > > ant use UTF-8 to encode filename which contains the
> non-ascii
> > > > > > > > > > > > character.
> > > > > > > > > > > > So why we use iso8859-1 as our unchangeable default?
> > > > > > > > > > > > From the wiki
> http://en.wikipedia.org/wiki/ISO8859-1, it says "In
> > > > > > > > > > > > computing
> > > > > > > > > > > > applications, encodings that provide full UCS support
> (such as
> > > > > > > > > > > >
> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> > > > > > > > > > > > UTF-16
> <http://en.wikipedia.org/wiki/UTF-16>) are finding
> increasing
> > > > > > > > > > > >
> > > > > > > > > > > >  favor
> > > > > > > > > > > >
> > > > > > > > > > >  over encodings based on ISO 8859-1." Should we simply
> change
> > > > > > > > > > > iso8859-1
> > > > > > > > > > >
> > > > > > > > > > > > to
> > > > > > > > > > > > utf-8?
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Yours sincerely,
> > > > > > > > > > > > Charles Lee
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > > > Best Regards!
> > > > > > > > > > >
> > > > > > > > > > > Jimmy, Jing Lv
> > > > > > > > > > > China Software Development Lab, IBM
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Yours sincerely,
> > > > > > > > > > Charles Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > --
> > > > > > > Best Regards,
> > > > > > > Regis.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > > Regis.
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Regis.
> > >
> >
> >
> >
> > --
> > Yours sincerely,
> > Charles Lee
> >
> >
>
>
>
> --
> Yours sincerely,
> Charles Lee
>
>
>

Re: Shall we change our file.encoding

Posted by Alexey Petrenko <al...@gmail.com>.

As far as I understand, using UTF as a default is not a good idea if
you run Harmony on the system where UTF is not default encoding.
Because default encoding is used for opening the files for example.
And we will definitely get decoding error while opening text files on
such systems. So this will make Harmony unusable in such cases.

It will be much better to hardcode the translation table if there is
now any other way.

Alexey

2009/7/16 Charles Lee <li...@gmail.com>:
> Hi guys,
>
> I have add the locale function in the drlvm, the patch is attached. Please
> try this new patch on the linux.
>
> The patch should work on the linux but fail on the windows. Because windows
> returns code page not charset from the setlocale. I hv tried long time to
> get the charset name from the codepage, for example:
> CPINFOEX cpInfoEx;
> BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
> if (iReturn > 0) {
>     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
> }
> But I only get the full name without any format.
>
> There is code page identifiers map in the msdn, detail here. I may hard code
> this map in the file. But the note on the msdn says:
>      "ANSI code pages can be different on different computers, or can be
> changed for a single computer, leading to data corruption. For the most
> consistent results, applications should use Unicode, such as UTF-8 or
> UTF-16, instead of a specific code page."
> I am afraid hard-code will fail on some machines. (By the way, this seems
> the UTF-8 is suggested to be the default again :-)
>
> There is also a class Encoding in the VC++, detail here. But we can not use
> it here.
>
> So anyone knows some thing about locale on the windows?
> Again, shall use UTF-8 as our default?
>
> On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com> wrote:
>>
>> That seems we should add it in the drlvm.
>>
>> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>>>
>>> Nathan Beyer wrote:
>>>>
>>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>>> DRLVM?
>>>
>>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>>
>>>>
>>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>>>>>
>>>>> Kevin Zhou wrote:
>>>>>>
>>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" property
>>>>>> adown
>>>>>> VM but fails to get the correct encoding.
>>>>>>
>>>>>> Regis, do you know any other specific ways that CL can gain the right
>>>>>> property?
>>>>>
>>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>>>
>>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>>>>>
>>>>>>> Charles Lee wrote:
>>>>>>>
>>>>>>>> Hi Nanthan,
>>>>>>>>
>>>>>>>> If the file encoding derive from the OS, it should be the some bugs
>>>>>>>> in
>>>>>>>> it
>>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>>>>>>> codec
>>>>>>>> is
>>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>>>>>
>>>>>>> Classlib expected vm do this and set the property, but it didn't, so
>>>>>>> we
>>>>>>> have to do this by ourselves.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Are we talking about windows or linux?the default file encoding
>>>>>>>> should
>>>>>>>>>
>>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>>>>>>>> <fi...@gmail.com>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and it
>>>>>>>>>>> sounds
>>>>>>>>>>> reasonable.
>>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we need
>>>>>>>>>>> to
>>>>>>>>>>> run
>>>>>>>>>>> more tests to verify?
>>>>>>>>>>>
>>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>>  Hi guys:
>>>>>>>>>>>
>>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>>>>>>>>>>> meeting
>>>>>>>>>>>> some
>>>>>>>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>>>>>>>> default
>>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>>>>>>>>>>> default is
>>>>>>>>>>>>
>>>>>>>>>>>>  UTF-8
>>>>>>>>>>>
>>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>>>>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>>>>>>>> 8859-1.
>>>>>>>>>>>> Because: (correct me if wrong :-)
>>>>>>>>>>>>
>>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null if
>>>>>>>>>>>> we
>>>>>>>>>>>> call
>>>>>>>>>>>>
>>>>>>>>>>>>  vm
>>>>>>>>>>>
>>>>>>>>>>>  method
>>>>>>>>>>>>
>>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from
>>>>>>>>>>>> vm,
>>>>>>>>>>>> we
>>>>>>>>>>>>
>>>>>>>>>>>>  set
>>>>>>>>>>>
>>>>>>>>>>>  Sorry, it should be luniglob.c
>>>>>>>>>>>
>>>>>>>>>>  8859-1.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>>>>>>>>>>>
>>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>>>>>>>> character.
>>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>>>>>>>>>>> "In
>>>>>>>>>>>> computing
>>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>>>>>>>>>> increasing
>>>>>>>>>>>>
>>>>>>>>>>>>  favor
>>>>>>>>>>>
>>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>>>>>>>> iso8859-1
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>> utf-8?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>> Charles Lee
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Best Regards!
>>>>>>>>>>>
>>>>>>>>>>> Jimmy, Jing Lv
>>>>>>>>>>> China Software Development Lab, IBM
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Yours sincerely,
>>>>>>>>>> Charles Lee
>>>>>>>>>>
>>>>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Regis.
>>>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Regis.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Regis.
>>
>>
>>
>> --
>> Yours sincerely,
>> Charles Lee
>>
>
>
>
> --
> Yours sincerely,
> Charles Lee
>
>

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Hi guys,

I have add the locale function in the drlvm, the patch is attached. Please
try this new patch on the linux.

The patch should work on the linux but fail on the windows. Because windows
returns code page not charset from the setlocale. I hv tried long time to
get the charset name from the codepage, for example:
CPINFOEX cpInfoEx;
BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
if (iReturn > 0) {
    printf("FULL NAME %s\n", cPinfoEx,CodePageName);
}
But I only get the full name without any format.

There is code page identifiers map in the msdn, detail
here<http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx>.
I may hard code this map in the file. But the note on the msdn says:
     "ANSI code pages can be different on different computers, or can be
changed for a single computer, leading to data corruption. For the most
consistent results, applications should use Unicode, such as UTF-8 or
UTF-16, instead of a specific code page."
I am afraid hard-code will fail on some machines. (By the way, this seems
the UTF-8 is suggested to be the default again :-)

There is also a class Encoding in the VC++, detail
here<http://msdn.microsoft.com/en-us/library/system.text.encoding%28VS.80%29.aspx>.
But we can not use it here.

So anyone knows some thing about locale on the windows?
Again, shall use UTF-8 as our default?

On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <li...@gmail.com> wrote:

> That seems we should add it in the drlvm.
>
>
> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:
>
>> Nathan Beyer wrote:
>>
>>> Is the IBM VME dealing with this correctly? Do we just need to fix DRLVM?
>>>
>>
>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>
>>
>>
>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>>>
>>>> Kevin Zhou wrote:
>>>>
>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" property
>>>>> adown
>>>>> VM but fails to get the correct encoding.
>>>>>
>>>>> Regis, do you know any other specific ways that CL can gain the right
>>>>> property?
>>>>>
>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>>
>>>>  Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>>>>
>>>>>  Charles Lee wrote:
>>>>>>
>>>>>>  Hi Nanthan,
>>>>>>>
>>>>>>> If the file encoding derive from the OS, it should be the some bugs
>>>>>>> in
>>>>>>> it
>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>>>>>> codec
>>>>>>> is
>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>>>>
>>>>>>>  Classlib expected vm do this and set the property, but it didn't, so
>>>>>> we
>>>>>> have to do this by ourselves.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Are we talking about windows or linux?the default file encoding
>>>>>>> should
>>>>>>>
>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <firepure@gmail.com
>>>>>>>> >
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>  Hi,
>>>>>>>>>
>>>>>>>>>   Charles, I believe UTF-8 is the default encoding for RI, and it
>>>>>>>>>> sounds
>>>>>>>>>> reasonable.
>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we need
>>>>>>>>>> to
>>>>>>>>>> run
>>>>>>>>>> more tests to verify?
>>>>>>>>>>
>>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>>>>>>
>>>>>>>>>>  Hi guys:
>>>>>>>>>>
>>>>>>>>>>  I am doing some test cases on the ant junit test case and meeting
>>>>>>>>>>> some
>>>>>>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>>>>>>> default
>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>  UTF-8
>>>>>>>>>>>
>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>>>>>>
>>>>>>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736
>>>>>>>>>>> >,
>>>>>>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>>>>>>> 8859-1.
>>>>>>>>>>> Because: (correct me if wrong :-)
>>>>>>>>>>>
>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null if
>>>>>>>>>>> we
>>>>>>>>>>> call
>>>>>>>>>>>
>>>>>>>>>>>  vm
>>>>>>>>>>>
>>>>>>>>>>  method
>>>>>>>>>>
>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from
>>>>>>>>>>> vm,
>>>>>>>>>>> we
>>>>>>>>>>>
>>>>>>>>>>>  set
>>>>>>>>>>>
>>>>>>>>>>  Sorry, it should be luniglob.c
>>>>>>>>>>
>>>>>>>>>>   8859-1.
>>>>>>>>>
>>>>>>>>>> 3. we can not set file.encode on the run time.
>>>>>>>>>>>
>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>>>>>>> character.
>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>>>>>>>>>> "In
>>>>>>>>>>> computing
>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>>>>>>>>> increasing
>>>>>>>>>>>
>>>>>>>>>>>  favor
>>>>>>>>>>>
>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>>>>>>> iso8859-1
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>> utf-8?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>> Charles Lee
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>>
>>>>>>>>>> Best Regards!
>>>>>>>>>>
>>>>>>>>>> Jimmy, Jing Lv
>>>>>>>>>> China Software Development Lab, IBM
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>> Yours sincerely,
>>>>>>>>> Charles Lee
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>> Best Regards,
>>>>>> Regis.
>>>>>>
>>>>>>
>>>> --
>>>> Best Regards,
>>>> Regis.
>>>>
>>>>
>>>
>>
>> --
>> Best Regards,
>> Regis.
>>
>
>
>
> --
> Yours sincerely,
> Charles Lee
>
>


-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

That seems we should add it in the drlvm.

On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu...@gmail.com> wrote:

> Nathan Beyer wrote:
>
>> Is the IBM VME dealing with this correctly? Do we just need to fix DRLVM?
>>
>
> Yes, I only tested on Linux, IBM VME set the property correctly.
>
>
>
>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>>
>>> Kevin Zhou wrote:
>>>
>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" property
>>>> adown
>>>> VM but fails to get the correct encoding.
>>>>
>>>> Regis, do you know any other specific ways that CL can gain the right
>>>> property?
>>>>
>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>
>>>  Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>>>
>>>>  Charles Lee wrote:
>>>>>
>>>>>  Hi Nanthan,
>>>>>>
>>>>>> If the file encoding derive from the OS, it should be the some bugs in
>>>>>> it
>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>>>>> codec
>>>>>> is
>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>>>
>>>>>>  Classlib expected vm do this and set the property, but it didn't, so
>>>>> we
>>>>> have to do this by ourselves.
>>>>>
>>>>>
>>>>>
>>>>>  On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>  Are we talking about windows or linux?the default file encoding
>>>>>> should
>>>>>>
>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>
>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>
>>>>>>>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>  Hi,
>>>>>>>>
>>>>>>>>   Charles, I believe UTF-8 is the default encoding for RI, and it
>>>>>>>>> sounds
>>>>>>>>> reasonable.
>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we need to
>>>>>>>>> run
>>>>>>>>> more tests to verify?
>>>>>>>>>
>>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>>>>>
>>>>>>>>>  Hi guys:
>>>>>>>>>
>>>>>>>>>  I am doing some test cases on the ant junit test case and meeting
>>>>>>>>>> some
>>>>>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>>>>>> default
>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default
>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>>  UTF-8
>>>>>>>>>>
>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>>>>>
>>>>>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>>>>>> 8859-1.
>>>>>>>>>> Because: (correct me if wrong :-)
>>>>>>>>>>
>>>>>>>>>> 1. we remove the set code in the vm. we will always get null if we
>>>>>>>>>> call
>>>>>>>>>>
>>>>>>>>>>  vm
>>>>>>>>>>
>>>>>>>>>  method
>>>>>>>>>
>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from
>>>>>>>>>> vm,
>>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>>>  set
>>>>>>>>>>
>>>>>>>>>  Sorry, it should be luniglob.c
>>>>>>>>>
>>>>>>>>>   8859-1.
>>>>>>>>
>>>>>>>>> 3. we can not set file.encode on the run time.
>>>>>>>>>>
>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>>>>>> character.
>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>>>>>>>>> computing
>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>>>>>>>> increasing
>>>>>>>>>>
>>>>>>>>>>  favor
>>>>>>>>>>
>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>>>>>> iso8859-1
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>> utf-8?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Yours sincerely,
>>>>>>>>>> Charles Lee
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>>
>>>>>>>>> Best Regards!
>>>>>>>>>
>>>>>>>>> Jimmy, Jing Lv
>>>>>>>>> China Software Development Lab, IBM
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>> Yours sincerely,
>>>>>>>> Charles Lee
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>> Best Regards,
>>>>> Regis.
>>>>>
>>>>>
>>> --
>>> Best Regards,
>>> Regis.
>>>
>>>
>>
>
> --
> Best Regards,
> Regis.
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Regis <xu...@gmail.com>.

Nathan Beyer wrote:
> Is the IBM VME dealing with this correctly? Do we just need to fix DRLVM?

Yes, I only tested on Linux, IBM VME set the property correctly.

> 
> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
>> Kevin Zhou wrote:
>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" property
>>> adown
>>> VM but fails to get the correct encoding.
>>>
>>> Regis, do you know any other specific ways that CL can gain the right
>>> property?
>> We can get from OS directly. Maybe just read env variables on Linux?
>>
>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>>
>>>> Charles Lee wrote:
>>>>
>>>>> Hi Nanthan,
>>>>>
>>>>> If the file encoding derive from the OS, it should be the some bugs in
>>>>> it
>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default codec
>>>>> is
>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>>
>>>> Classlib expected vm do this and set the property, but it didn't, so we
>>>> have to do this by ourselves.
>>>>
>>>>
>>>>
>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com> wrote:
>>>>>
>>>>>  Are we talking about windows or linux?the default file encoding should
>>>>>> derive from the OS. I believe that's defined by the specs.
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>
>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com> wrote:
>>>>>>
>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hi,
>>>>>>>
>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and it
>>>>>>>> sounds
>>>>>>>> reasonable.
>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we need to
>>>>>>>> run
>>>>>>>> more tests to verify?
>>>>>>>>
>>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>>>>
>>>>>>>>  Hi guys:
>>>>>>>>
>>>>>>>>> I am doing some test cases on the ant junit test case and meeting
>>>>>>>>> some
>>>>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>>>>> default
>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default is
>>>>>>>>>
>>>>>>>>>  UTF-8
>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>>>>> 8859-1.
>>>>>>>>> Because: (correct me if wrong :-)
>>>>>>>>>
>>>>>>>>> 1. we remove the set code in the vm. we will always get null if we
>>>>>>>>> call
>>>>>>>>>
>>>>>>>>>  vm
>>>>>>>>  method
>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from vm,
>>>>>>>>> we
>>>>>>>>>
>>>>>>>>>  set
>>>>>>>>  Sorry, it should be luniglob.c
>>>>>>>>
>>>>>>>  8859-1.
>>>>>>>>> 3. we can not set file.encode on the run time.
>>>>>>>>>
>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>>>>> character.
>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>>>>>>>> computing
>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
>>>>>>>>>
>>>>>>>>>  favor
>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>>>>> iso8859-1
>>>>>>>>> to
>>>>>>>>> utf-8?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Yours sincerely,
>>>>>>>>> Charles Lee
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Best Regards!
>>>>>>>>
>>>>>>>> Jimmy, Jing Lv
>>>>>>>> China Software Development Lab, IBM
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Yours sincerely,
>>>>>>> Charles Lee
>>>>>>>
>>>>>>>
>>>> --
>>>> Best Regards,
>>>> Regis.
>>>>
>>
>> --
>> Best Regards,
>> Regis.
>>
> 


-- 
Best Regards,
Regis.

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nd...@apache.org>.

Is the IBM VME dealing with this correctly? Do we just need to fix DRLVM?

On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu...@gmail.com> wrote:
> Kevin Zhou wrote:
>>
>> Yea, from luniglob.c, CL attempts to read the "file.encoding" property
>> adown
>> VM but fails to get the correct encoding.
>>
>> Regis, do you know any other specific ways that CL can gain the right
>> property?
>
> We can get from OS directly. Maybe just read env variables on Linux?
>
>>
>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
>>
>>> Charles Lee wrote:
>>>
>>>> Hi Nanthan,
>>>>
>>>> If the file encoding derive from the OS, it should be the some bugs in
>>>> it
>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default codec
>>>> is
>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>
>>> Classlib expected vm do this and set the property, but it didn't, so we
>>> have to do this by ourselves.
>>>
>>>
>>>
>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com> wrote:
>>>>
>>>>  Are we talking about windows or linux?the default file encoding should
>>>>>
>>>>> derive from the OS. I believe that's defined by the specs.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>
>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com> wrote:
>>>>>
>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and it
>>>>>>> sounds
>>>>>>> reasonable.
>>>>>>>  BTW, it may encounter some compatibility problem, maybe we need to
>>>>>>> run
>>>>>>> more tests to verify?
>>>>>>>
>>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>>>
>>>>>>>  Hi guys:
>>>>>>>
>>>>>>>> I am doing some test cases on the ant junit test case and meeting
>>>>>>>> some
>>>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>>>> default
>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default is
>>>>>>>>
>>>>>>>>  UTF-8
>>>>>>>
>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>>>>
>>>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>>>> 8859-1.
>>>>>>>> Because: (correct me if wrong :-)
>>>>>>>>
>>>>>>>> 1. we remove the set code in the vm. we will always get null if we
>>>>>>>> call
>>>>>>>>
>>>>>>>>  vm
>>>>>>>
>>>>>>>  method
>>>>>>>>
>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from vm,
>>>>>>>> we
>>>>>>>>
>>>>>>>>  set
>>>>>>>
>>>>>>>  Sorry, it should be luniglob.c
>>>>>>>
>>>>>>  8859-1.
>>>>>>>>
>>>>>>>> 3. we can not set file.encode on the run time.
>>>>>>>>
>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>>>> character.
>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>>>>>>> computing
>>>>>>>> applications, encodings that provide full UCS support (such as
>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
>>>>>>>>
>>>>>>>>  favor
>>>>>>>
>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>>>> iso8859-1
>>>>>>>>
>>>>>>>> to
>>>>>>>> utf-8?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Yours sincerely,
>>>>>>>> Charles Lee
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Best Regards!
>>>>>>>
>>>>>>> Jimmy, Jing Lv
>>>>>>> China Software Development Lab, IBM
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Yours sincerely,
>>>>>> Charles Lee
>>>>>>
>>>>>>
>>>>
>>> --
>>> Best Regards,
>>> Regis.
>>>
>>
>
>
> --
> Best Regards,
> Regis.
>

Re: Shall we change our file.encoding

Posted by Regis <xu...@gmail.com>.

Kevin Zhou wrote:
> Yea, from luniglob.c, CL attempts to read the "file.encoding" property adown
> VM but fails to get the correct encoding.
> 
> Regis, do you know any other specific ways that CL can gain the right
> property?

We can get from OS directly. Maybe just read env variables on Linux?

> 
> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:
> 
>> Charles Lee wrote:
>>
>>> Hi Nanthan,
>>>
>>> If the file encoding derive from the OS, it should be the some bugs in it
>>> because on my LINUX machine the locale is en_US.UTF-8. Our default codec
>>> is
>>> still ISO8859-1. Do you know where can we found such codes?
>>>
>> Classlib expected vm do this and set the property, but it didn't, so we
>> have to do this by ourselves.
>>
>>
>>
>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com> wrote:
>>>
>>>  Are we talking about windows or linux?the default file encoding should
>>>> derive from the OS. I believe that's defined by the specs.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>
>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com> wrote:
>>>>
>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>
>>>>
>>>>> wrote:
>>>>>
>>>>>  Hi,
>>>>>
>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and it sounds
>>>>>> reasonable.
>>>>>>  BTW, it may encounter some compatibility problem, maybe we need to run
>>>>>> more tests to verify?
>>>>>>
>>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>>
>>>>>>  Hi guys:
>>>>>>
>>>>>>> I am doing some test cases on the ant junit test case and meeting some
>>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>>> default
>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default is
>>>>>>>
>>>>>>>  UTF-8
>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>>> 8859-1.
>>>>>>> Because: (correct me if wrong :-)
>>>>>>>
>>>>>>> 1. we remove the set code in the vm. we will always get null if we
>>>>>>> call
>>>>>>>
>>>>>>>  vm
>>>>>>  method
>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from vm, we
>>>>>>>
>>>>>>>  set
>>>>>>  Sorry, it should be luniglob.c
>>>>>>
>>>>>   8859-1.
>>>>>>> 3. we can not set file.encode on the run time.
>>>>>>>
>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>>> character.
>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>>>>>> computing
>>>>>>> applications, encodings that provide full UCS support (such as
>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
>>>>>>>
>>>>>>>  favor
>>>>>>  over encodings based on ISO 8859-1." Should we simply change iso8859-1
>>>>>>> to
>>>>>>> utf-8?
>>>>>>>
>>>>>>> --
>>>>>>> Yours sincerely,
>>>>>>> Charles Lee
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>>
>>>>>> Best Regards!
>>>>>>
>>>>>> Jimmy, Jing Lv
>>>>>> China Software Development Lab, IBM
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Yours sincerely,
>>>>> Charles Lee
>>>>>
>>>>>
>>>
>> --
>> Best Regards,
>> Regis.
>>
> 


-- 
Best Regards,
Regis.

Re: Shall we change our file.encoding

Posted by Kevin Zhou <zh...@gmail.com>.

Yea, from luniglob.c, CL attempts to read the "file.encoding" property adown
VM but fails to get the correct encoding.

Regis, do you know any other specific ways that CL can gain the right
property?

Wed, Jul 15, 2009 at 9:59 AM, Regis <xu...@gmail.com> wrote:

> Charles Lee wrote:
>
>> Hi Nanthan,
>>
>> If the file encoding derive from the OS, it should be the some bugs in it
>> because on my LINUX machine the locale is en_US.UTF-8. Our default codec
>> is
>> still ISO8859-1. Do you know where can we found such codes?
>>
>
> Classlib expected vm do this and set the property, but it didn't, so we
> have to do this by ourselves.
>
>
>
>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com> wrote:
>>
>>  Are we talking about windows or linux?the default file encoding should
>>> derive from the OS. I believe that's defined by the specs.
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com> wrote:
>>>
>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>
>>>
>>>> wrote:
>>>>
>>>>  Hi,
>>>>
>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and it sounds
>>>>> reasonable.
>>>>>  BTW, it may encounter some compatibility problem, maybe we need to run
>>>>> more tests to verify?
>>>>>
>>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>>
>>>>>  Hi guys:
>>>>>
>>>>>> I am doing some test cases on the ant junit test case and meeting some
>>>>>> encoding problems. I find they are maybe caused by the different
>>>>>> default
>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default is
>>>>>>
>>>>>>  UTF-8
>>>>>
>>>>>  but harmony is 8859-1. And then I have encountered
>>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>>> and the two diffs attached on that issue. It seems we always get
>>>>>> 8859-1.
>>>>>> Because: (correct me if wrong :-)
>>>>>>
>>>>>> 1. we remove the set code in the vm. we will always get null if we
>>>>>> call
>>>>>>
>>>>>>  vm
>>>>>
>>>>>  method
>>>>>> 2. we set the file.encode in the libglob.c, if we got null from vm, we
>>>>>>
>>>>>>  set
>>>>>
>>>>>  Sorry, it should be luniglob.c
>>>>>
>>>>
>>>>   8859-1.
>>>>>
>>>>>> 3. we can not set file.encode on the run time.
>>>>>>
>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>>> character.
>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>>>>> computing
>>>>>> applications, encodings that provide full UCS support (such as
>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
>>>>>>
>>>>>>  favor
>>>>>
>>>>>  over encodings based on ISO 8859-1." Should we simply change iso8859-1
>>>>>> to
>>>>>> utf-8?
>>>>>>
>>>>>> --
>>>>>> Yours sincerely,
>>>>>> Charles Lee
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>>
>>>>> Best Regards!
>>>>>
>>>>> Jimmy, Jing Lv
>>>>> China Software Development Lab, IBM
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Yours sincerely,
>>>> Charles Lee
>>>>
>>>>
>>
>>
>
> --
> Best Regards,
> Regis.
>

Re: Shall we change our file.encoding

Posted by Regis <xu...@gmail.com>.

Charles Lee wrote:
> Hi Nanthan,
> 
> If the file encoding derive from the OS, it should be the some bugs in it
> because on my LINUX machine the locale is en_US.UTF-8. Our default codec is
> still ISO8859-1. Do you know where can we found such codes?

Classlib expected vm do this and set the property, but it didn't, so we have to 
do this by ourselves.

> 
> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com> wrote:
> 
>> Are we talking about windows or linux?the default file encoding should
>> derive from the OS. I believe that's defined by the specs.
>>
>> Sent from my iPhone
>>
>>
>> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com> wrote:
>>
>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>
>>> wrote:
>>>
>>>  Hi,
>>>>   Charles, I believe UTF-8 is the default encoding for RI, and it sounds
>>>> reasonable.
>>>>   BTW, it may encounter some compatibility problem, maybe we need to run
>>>> more tests to verify?
>>>>
>>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>>
>>>>  Hi guys:
>>>>> I am doing some test cases on the ant junit test case and meeting some
>>>>> encoding problems. I find they are maybe caused by the different default
>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default is
>>>>>
>>>> UTF-8
>>>>
>>>>> but harmony is 8859-1. And then I have encountered
>>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>> and the two diffs attached on that issue. It seems we always get 8859-1.
>>>>> Because: (correct me if wrong :-)
>>>>>
>>>>> 1. we remove the set code in the vm. we will always get null if we call
>>>>>
>>>> vm
>>>>
>>>>> method
>>>>> 2. we set the file.encode in the libglob.c, if we got null from vm, we
>>>>>
>>>> set
>>>>
>>>>  Sorry, it should be luniglob.c
>>>
>>>>  8859-1.
>>>>> 3. we can not set file.encode on the run time.
>>>>>
>>>>> ant use UTF-8 to encode filename which contains the non-ascii character.
>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>>>> computing
>>>>> applications, encodings that provide full UCS support (such as
>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
>>>>>
>>>> favor
>>>>
>>>>> over encodings based on ISO 8859-1." Should we simply change iso8859-1
>>>>> to
>>>>> utf-8?
>>>>>
>>>>> --
>>>>> Yours sincerely,
>>>>> Charles Lee
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Best Regards!
>>>>
>>>> Jimmy, Jing Lv
>>>> China Software Development Lab, IBM
>>>>
>>>>
>>>
>>> --
>>> Yours sincerely,
>>> Charles Lee
>>>
> 
> 


-- 
Best Regards,
Regis.

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

Hi Nanthan,

If the file encoding derive from the OS, it should be the some bugs in it
because on my LINUX machine the locale is en_US.UTF-8. Our default codec is
still ISO8859-1. Do you know where can we found such codes?

On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nb...@gmail.com> wrote:

> Are we talking about windows or linux?the default file encoding should
> derive from the OS. I believe that's defined by the specs.
>
> Sent from my iPhone
>
>
> On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com> wrote:
>
>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>
>> wrote:
>>
>>  Hi,
>>>
>>>   Charles, I believe UTF-8 is the default encoding for RI, and it sounds
>>> reasonable.
>>>   BTW, it may encounter some compatibility problem, maybe we need to run
>>> more tests to verify?
>>>
>>> 2009/7/14 Charles Lee <li...@gmail.com>
>>>
>>>  Hi guys:
>>>>
>>>> I am doing some test cases on the ant junit test case and meeting some
>>>> encoding problems. I find they are maybe caused by the different default
>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default is
>>>>
>>> UTF-8
>>>
>>>> but harmony is 8859-1. And then I have encountered
>>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>> and the two diffs attached on that issue. It seems we always get 8859-1.
>>>> Because: (correct me if wrong :-)
>>>>
>>>> 1. we remove the set code in the vm. we will always get null if we call
>>>>
>>> vm
>>>
>>>> method
>>>> 2. we set the file.encode in the libglob.c, if we got null from vm, we
>>>>
>>> set
>>>
>>>  Sorry, it should be luniglob.c
>>
>>
>>>  8859-1.
>>>> 3. we can not set file.encode on the run time.
>>>>
>>>> ant use UTF-8 to encode filename which contains the non-ascii character.
>>>> So why we use iso8859-1 as our unchangeable default?
>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>>> computing
>>>> applications, encodings that provide full UCS support (such as
>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
>>>>
>>> favor
>>>
>>>> over encodings based on ISO 8859-1." Should we simply change iso8859-1
>>>> to
>>>> utf-8?
>>>>
>>>> --
>>>> Yours sincerely,
>>>> Charles Lee
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Best Regards!
>>>
>>> Jimmy, Jing Lv
>>> China Software Development Lab, IBM
>>>
>>>
>>
>>
>> --
>> Yours sincerely,
>> Charles Lee
>>
>


-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Nathan Beyer <nb...@gmail.com>.

Are we talking about windows or linux?the default file encoding should  
derive from the OS. I believe that's defined by the specs.

Sent from my iPhone

On Jul 14, 2009, at 5:51 AM, Charles Lee <li...@gmail.com> wrote:

> On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com>  
> wrote:
>
>> Hi,
>>
>>    Charles, I believe UTF-8 is the default encoding for RI, and it  
>> sounds
>> reasonable.
>>    BTW, it may encounter some compatibility problem, maybe we need  
>> to run
>> more tests to verify?
>>
>> 2009/7/14 Charles Lee <li...@gmail.com>
>>
>>> Hi guys:
>>>
>>> I am doing some test cases on the ant junit test case and meeting  
>>> some
>>> encoding problems. I find they are maybe caused by the different  
>>> default
>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default is
>> UTF-8
>>> but harmony is 8859-1. And then I have encountered
>>> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
>>> and the two diffs attached on that issue. It seems we always get  
>>> 8859-1.
>>> Because: (correct me if wrong :-)
>>>
>>> 1. we remove the set code in the vm. we will always get null if we  
>>> call
>> vm
>>> method
>>> 2. we set the file.encode in the libglob.c, if we got null from  
>>> vm, we
>> set
>>
> Sorry, it should be luniglob.c
>
>>
>>> 8859-1.
>>> 3. we can not set file.encode on the run time.
>>>
>>> ant use UTF-8 to encode filename which contains the non-ascii  
>>> character.
>>> So why we use iso8859-1 as our unchangeable default?
>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
>>> computing
>>> applications, encodings that provide full UCS support (such as
>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
>> favor
>>> over encodings based on ISO 8859-1." Should we simply change  
>>> iso8859-1 to
>>> utf-8?
>>>
>>> --
>>> Yours sincerely,
>>> Charles Lee
>>>
>>
>>
>>
>> --
>>
>> Best Regards!
>>
>> Jimmy, Jing Lv
>> China Software Development Lab, IBM
>>
>
>
>
> -- 
> Yours sincerely,
> Charles Lee

Re: Shall we change our file.encoding

Posted by Charles Lee <li...@gmail.com>.

On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <fi...@gmail.com> wrote:

> Hi,
>
>     Charles, I believe UTF-8 is the default encoding for RI, and it sounds
> reasonable.
>     BTW, it may encounter some compatibility problem, maybe we need to run
> more tests to verify?
>
> 2009/7/14 Charles Lee <li...@gmail.com>
>
> > Hi guys:
> >
> > I am doing some test cases on the ant junit test case and meeting some
> > encoding problems. I find they are maybe caused by the different default
> > encoding from RI and harmony. My local is en_US.UTF-8, RI default is
> UTF-8
> > but harmony is 8859-1. And then I have encountered
> > HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
> > and the two diffs attached on that issue. It seems we always get 8859-1.
> > Because: (correct me if wrong :-)
> >
> > 1. we remove the set code in the vm. we will always get null if we call
> vm
> > method
> > 2. we set the file.encode in the libglob.c, if we got null from vm, we
> set
>
Sorry, it should be luniglob.c

>
> > 8859-1.
> > 3. we can not set file.encode on the run time.
> >
> > ant use UTF-8 to encode filename which contains the non-ascii character.
> > So why we use iso8859-1 as our unchangeable default?
> > From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
> > computing
> > applications, encodings that provide full UCS support (such as
> > UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> > UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing
> favor
> > over encodings based on ISO 8859-1." Should we simply change iso8859-1 to
> > utf-8?
> >
> > --
> > Yours sincerely,
> > Charles Lee
> >
>
>
>
> --
>
> Best Regards!
>
> Jimmy, Jing Lv
> China Software Development Lab, IBM
>



-- 
Yours sincerely,
Charles Lee

Re: Shall we change our file.encoding

Posted by Alexei Fedotov <al...@gmail.com>.

Jimmy,Could you please fill GSoC evaluation? There should be a red link to
the left at [1].
Thanks!

[1] http://socghop.appspot.com/

> Hello gentlemen,
> We are still missing this evaluation. Please take a look into this
> ASAP, as the extended deadline for late evaluations is today at 23:00 UTC.

On Tue, Jul 14, 2009 at 2:12 PM, Jimmy,Jing Lv <fi...@gmail.com> wrote:

> Hi,
> [...]

Re: Shall we change our file.encoding

Posted by "Jimmy,Jing Lv" <fi...@gmail.com>.

Hi,

     Charles, I believe UTF-8 is the default encoding for RI, and it sounds
reasonable.
     BTW, it may encounter some compatibility problem, maybe we need to run
more tests to verify?

2009/7/14 Charles Lee <li...@gmail.com>

> Hi guys:
>
> I am doing some test cases on the ant junit test case and meeting some
> encoding problems. I find they are maybe caused by the different default
> encoding from RI and harmony. My local is en_US.UTF-8, RI default is UTF-8
> but harmony is 8859-1. And then I have encountered
> HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
> and the two diffs attached on that issue. It seems we always get 8859-1.
> Because: (correct me if wrong :-)
>
> 1. we remove the set code in the vm. we will always get null if we call vm
> method
> 2. we set the file.encode in the libglob.c, if we got null from vm, we set
> 8859-1.
> 3. we can not set file.encode on the run time.
>
> ant use UTF-8 to encode filename which contains the non-ascii character.
> So why we use iso8859-1 as our unchangeable default?
> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says "In
> computing
> applications, encodings that provide full UCS support (such as
> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing favor
> over encodings based on ISO 8859-1." Should we simply change iso8859-1 to
> utf-8?
>
> --
> Yours sincerely,
> Charles Lee
>



-- 

Best Regards!

Jimmy, Jing Lv
China Software Development Lab, IBM