You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Milind Gadre <mi...@ecplatforms.com> on 2001/01/11 00:30:19 UTC

More questions on I18N

OK, now I have an XML document with Japanese chars in SHIFT_JIS format.
This document gets parsed without any problems.

Question: In what format is the 'getNodeValue' data returned? I am
assuming Unicode considering it is returning a Java String. Is this
correct?

If so, why am I seeing this phenomenon?

    Input char(s) from XML file: "成功"

    Chars output to file: "??" (that's it: two question marks)

I am using a BufferedWriter(FileWriter) to write to file, or System.out
to print to console. Same result both cases.

Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com

Re: One More question on I18N

Posted by Milind Gadre <mi...@ecplatforms.com>.

Thanks to Andy Clark and Jean-Frederic Clere for helping out. I think I
have a better understanding of the encoding chain now.

Regards...

----- Original Message -----
From: "jean-frederic clere" <jf...@fujitsu.siemens.es>
To: <xe...@xml.apache.org>
Sent: Friday, January 12, 2001 3:20 AM
Subject: Re: One More question on I18N


> Andy Clark wrote:
> >
> > Milind Gadre wrote:
> > >         ==[get-bytes]==> neutral byte array
> >
> > Depends on which String#getBytes method you're calling.
> >
> >   * getBytes(int,int,byte[],int) is deprecated
> >   * getBytes() returns bytes using the default character encoder --
> >                who knows what this is? may not be what you think
> That the encoding of the machine ASCII in ASCII machines and EBCDIC in
> EBCDIC machines.
> The following code:
> ++++

Re: One More question on I18N

Posted by jean-frederic clere <jf...@fujitsu.siemens.es>.

Andy Clark wrote:
> 
> Milind Gadre wrote:
> >         ==[get-bytes]==> neutral byte array
> 
> Depends on which String#getBytes method you're calling.
> 
>   * getBytes(int,int,byte[],int) is deprecated
>   * getBytes() returns bytes using the default character encoder --
>                who knows what this is? may not be what you think
That the encoding of the machine ASCII in ASCII machines and EBCDIC in
EBCDIC machines.
The following code:
++++
import
java.io.*;                                                               
class TestGetbytes
{                                                            
  public static void main(String[] args)
{                                      
        String str =
"ABCDE";                                                   
        byte[]
byt;                                                             
        byt =
str.getBytes();                                                   
        int
j;                                                                  
        for
(j=0;j<byt.length;j++)                                              
                System.out.println("byte: " + j + ":" +
byt[j]);                
 
}                                                                             
}                                                                               
+++++
Gives the following result on my EBCDIC (BS2000) machine:
+++++
$ java
TestGetbytes                                                             
byte:
0:-63                                                                     
byte:
1:-62                                                                     
byte:
2:-61                                                                     
byte:
3:-60                                                                     
byte: 4:-59
+++++                                                                     

>   * getBytes(String) allows you to specify the encoding
> 
> I don't think using getBytes is as efficient as the output
> stream writer mechanism because you create a lot of byte arrays
> that are just thrown away later.
> 
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: One More question on I18N

Posted by Andy Clark <an...@apache.org>.

Milind Gadre wrote:
>         ==[get-bytes]==> neutral byte array

Depends on which String#getBytes method you're calling.

  * getBytes(int,int,byte[],int) is deprecated
  * getBytes() returns bytes using the default character encoder --
               who knows what this is? may not be what you think
  * getBytes(String) allows you to specify the encoding

I don't think using getBytes is as efficient as the output
stream writer mechanism because you create a lot of byte arrays
that are just thrown away later.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

One More question on I18N

Posted by Milind Gadre <mi...@ecplatforms.com>.

Andy, thanks for the help. I had accidentally stumbled upon the solution
and now write it out in UTF-8.

But please bear with me as I am still a little confused and had another
question.

XML file (shift-jis)
    ==[Xerces]==> Java-String (Unicode)
        ==[get-bytes]==> neutral byte array
            ==[OutputStream]==> byte file (*should* be unicode ...
yes??)

After all, I am just dumping a byte array into a file, it should
maintain whatever was the byte structure. But when I look at the dumped
file, it does not maintain the bytes from the shift-jis xml document.

I am sure I am missing some very trivial concept.

Regards...

> Because you're not writing the file in Shift-JIS format.
> Remember that in memory it is Unicode (UTF-16, actually) and
> that it has to be re-encoded back to Shift-JIS when you write
> your file. Use the following:
>
>   OutputStream stream = new FileOutputStream("document.xml");
>   Writer writer = new OutputStreamWriter(stream, "Shift_JIS");

Re: More questions on I18N

Posted by Andy Clark <an...@apache.org>.

Milind Gadre wrote:
> Question: In what format is the 'getNodeValue' data returned? I am
> assuming Unicode considering it is returning a Java String. Is this
> correct?

Yes.

> If so, why am I seeing this phenomenon?
> 
>     Input char(s) from XML file: "æå"
> 
>     Chars output to file: "??" (that's it: two question marks)
> 
> I am using a BufferedWriter(FileWriter) to write to file, or System.out
> to print to console. Same result both cases.

Because you're not writing the file in Shift-JIS format.
Remember that in memory it is Unicode (UTF-16, actually) and
that it has to be re-encoded back to Shift-JIS when you write
your file. Use the following:

  OutputStream stream = new FileOutputStream("document.xml");
  Writer writer = new OutputStreamWriter(stream, "Shift_JIS");

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org