You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Milind Gadre <mi...@ecplatforms.com> on 2001/01/11 00:30:19 UTC
More questions on I18N
OK, now I have an XML document with Japanese chars in SHIFT_JIS format.
This document gets parsed without any problems.
Question: In what format is the 'getNodeValue' data returned? I am
assuming Unicode considering it is returning a Java String. Is this
correct?
If so, why am I seeing this phenomenon?
Input char(s) from XML file: "成功"
Chars output to file: "??" (that's it: two question marks)
I am using a BufferedWriter(FileWriter) to write to file, or System.out
to print to console. Same result both cases.
Regards...
Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com
Re: One More question on I18N
Posted by Milind Gadre <mi...@ecplatforms.com>.
Thanks to Andy Clark and Jean-Frederic Clere for helping out. I think I
have a better understanding of the encoding chain now.
Regards...
----- Original Message -----
From: "jean-frederic clere" <jf...@fujitsu.siemens.es>
To: <xe...@xml.apache.org>
Sent: Friday, January 12, 2001 3:20 AM
Subject: Re: One More question on I18N
> Andy Clark wrote:
> >
> > Milind Gadre wrote:
> > > ==[get-bytes]==> neutral byte array
> >
> > Depends on which String#getBytes method you're calling.
> >
> > * getBytes(int,int,byte[],int) is deprecated
> > * getBytes() returns bytes using the default character encoder --
> > who knows what this is? may not be what you think
> That the encoding of the machine ASCII in ASCII machines and EBCDIC in
> EBCDIC machines.
> The following code:
> ++++
Re: One More question on I18N
Posted by jean-frederic clere <jf...@fujitsu.siemens.es>.
Andy Clark wrote:
>
> Milind Gadre wrote:
> > ==[get-bytes]==> neutral byte array
>
> Depends on which String#getBytes method you're calling.
>
> * getBytes(int,int,byte[],int) is deprecated
> * getBytes() returns bytes using the default character encoder --
> who knows what this is? may not be what you think
That the encoding of the machine ASCII in ASCII machines and EBCDIC in
EBCDIC machines.
The following code:
++++
import
java.io.*;
class TestGetbytes
{
public static void main(String[] args)
{
String str =
"ABCDE";
byte[]
byt;
byt =
str.getBytes();
int
j;
for
(j=0;j<byt.length;j++)
System.out.println("byte: " + j + ":" +
byt[j]);
}
}
+++++
Gives the following result on my EBCDIC (BS2000) machine:
+++++
$ java
TestGetbytes
byte:
0:-63
byte:
1:-62
byte:
2:-61
byte:
3:-60
byte: 4:-59
+++++
> * getBytes(String) allows you to specify the encoding
>
> I don't think using getBytes is as efficient as the output
> stream writer mechanism because you create a lot of byte arrays
> that are just thrown away later.
>
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
Re: One More question on I18N
Posted by Andy Clark <an...@apache.org>.
Milind Gadre wrote:
> ==[get-bytes]==> neutral byte array
Depends on which String#getBytes method you're calling.
* getBytes(int,int,byte[],int) is deprecated
* getBytes() returns bytes using the default character encoder --
who knows what this is? may not be what you think
* getBytes(String) allows you to specify the encoding
I don't think using getBytes is as efficient as the output
stream writer mechanism because you create a lot of byte arrays
that are just thrown away later.
--
Andy Clark * IBM, TRL - Japan * andyc@apache.org
One More question on I18N
Posted by Milind Gadre <mi...@ecplatforms.com>.
Andy, thanks for the help. I had accidentally stumbled upon the solution
and now write it out in UTF-8.
But please bear with me as I am still a little confused and had another
question.
XML file (shift-jis)
==[Xerces]==> Java-String (Unicode)
==[get-bytes]==> neutral byte array
==[OutputStream]==> byte file (*should* be unicode ...
yes??)
After all, I am just dumping a byte array into a file, it should
maintain whatever was the byte structure. But when I look at the dumped
file, it does not maintain the bytes from the shift-jis xml document.
I am sure I am missing some very trivial concept.
Regards...
> Because you're not writing the file in Shift-JIS format.
> Remember that in memory it is Unicode (UTF-16, actually) and
> that it has to be re-encoded back to Shift-JIS when you write
> your file. Use the following:
>
> OutputStream stream = new FileOutputStream("document.xml");
> Writer writer = new OutputStreamWriter(stream, "Shift_JIS");
Re: More questions on I18N
Posted by Andy Clark <an...@apache.org>.
Milind Gadre wrote:
> Question: In what format is the 'getNodeValue' data returned? I am
> assuming Unicode considering it is returning a Java String. Is this
> correct?
Yes.
> If so, why am I seeing this phenomenon?
>
> Input char(s) from XML file: "æå"
>
> Chars output to file: "??" (that's it: two question marks)
>
> I am using a BufferedWriter(FileWriter) to write to file, or System.out
> to print to console. Same result both cases.
Because you're not writing the file in Shift-JIS format.
Remember that in memory it is Unicode (UTF-16, actually) and
that it has to be re-encoded back to Shift-JIS when you write
your file. Use the following:
OutputStream stream = new FileOutputStream("document.xml");
Writer writer = new OutputStreamWriter(stream, "Shift_JIS");
--
Andy Clark * IBM, TRL - Japan * andyc@apache.org