You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@xmlbeans.apache.org by Michael White <wh...@gmail.com> on 2006/08/18 23:52:22 UTC

Cannot encode my XML document output into UTF-8

I can't properly encode my XML output file and would appreciate any help you
could offer!

For example, if I do the following:

<<
    ByteArrayOutputStream bos = new ByteArrayOutputStream();

    FileOutputStream fos = new FileOutputStream("C:/test.xml");
    PrintStream xmlStream = new PrintStream(fos, false, "UTF-8");

    XmlOptions printOptions = new XmlOptions();
    printOptions.setSavePrettyPrint();
    printOptions.setSavePrettyPrintIndent(2);
    printOptions.setUseDefaultNamespace();
    printOptions.setCharacterEncoding("UTF-8");

    paymentDoc.save(bos,printOptions);
    xmlStream.print(bos);   //xmlStream.print(bos.toString("UTF-8"));
    xmlStream.close();
>>

I receive a properly formatted file, with all of the data I require.
However, per textpad, the encoding is set to ANSI.  I've tried numerous
combinations of writers and encoding and can't seem to get the output into
UTF-8!  I'll be dealing with Japanese and Korean characters so it is a
necessity.

The crazy part is that if I perform the following:

<<
ByteArrayOutputStream bos = new ByteArrayOutputStream();

FileOutputStream fos = new FileOutputStream("C:/test.xml");
PrintStream xmlStream = new PrintStream(fos, false, "UTF-8");

bos.write("A?u$(He933u3'u(BaÌ3̇".getBytes("UTF-8"));
xmlStream.print(bos);
xmlStream.close();
>>

The resulting file is listed as properly encoded in UTF-8 format!?

I'm at my wits end.  I'm using the latest XmlBeans release as of today and
JDK 1.4.2_12.  I set the documentProperties encoding to UTF-8 as well and it
just doesn't want to play nice.

Help!

Thanks, Mike

RE: Cannot encode my XML document output into UTF-8

Posted by Radu Preotiuc-Pietro <ra...@bea.com>.
I couldn't find the time to look at this in detail, but here's a suggestion that may help:
 
TextPad (like Notepad) I think looks at the first bytes in the file and if it sees something like FF FE decides that the encoding is unicode. But your file being XML, it relies on the encoding="UTF-8" part to set the encoding to UTF-8 and doesn't use the bytes, which TextPad doesn't pick up. So in other words, I think you're fine. Try putting some non-ASCII chars in your file, open it in TextPad and then set the encoding manually to UTF-8 and check if the characters are the same.
 
The main idea in this story is that there is no "standard" mechanism to decide if a set of bytes are text in UTF-8 encoding or in ASCII encoding or a JPEG image (that's why XML needed an "encoding" attribute by the way). So as long as you have rules and mechanisms to ensure that the same encoding is used throughout your system, you are ok .
 
Radu

________________________________

From: Michael White [mailto:whitemichael@gmail.com] 
Sent: Friday, August 18, 2006 2:52 PM
To: user@xmlbeans.apache.org
Subject: Cannot encode my XML document output into UTF-8


I can't properly encode my XML output file and would appreciate any help you could offer!

For example, if I do the following:

<<
    ByteArrayOutputStream bos = new ByteArrayOutputStream();

    FileOutputStream fos = new FileOutputStream("C:/test.xml"); 
    PrintStream xmlStream = new PrintStream(fos, false, "UTF-8");    
       
    XmlOptions printOptions = new XmlOptions();
    printOptions.setSavePrettyPrint();
    printOptions.setSavePrettyPrintIndent (2);
    printOptions.setUseDefaultNamespace();
    printOptions.setCharacterEncoding("UTF-8");

    paymentDoc.save(bos,printOptions);
    xmlStream.print(bos);   //xmlStream.print(bos.toString("UTF-8")); 
    xmlStream.close();
>>

I receive a properly formatted file, with all of the data I require.  However, per textpad, the encoding is set to ANSI.  I've tried numerous combinations of writers and encoding and can't seem to get the output into UTF-8!  I'll be dealing with Japanese and Korean characters so it is a necessity. 

The crazy part is that if I perform the following:

<<
ByteArrayOutputStream bos = new ByteArrayOutputStream();

FileOutputStream fos = new FileOutputStream("C:/test.xml");
PrintStream xmlStream = new PrintStream(fos, false, "UTF-8");    

bos.write("A?u$(He933u3'u(BaÌ3̇".getBytes("UTF-8"));
xmlStream.print(bos);
xmlStream.close();
>>

The resulting file is listed as properly encoded in UTF-8 format!?

I'm at my wits end.  I'm using the latest XmlBeans release as of today and JDK 1.4.2_12.  I set the documentProperties encoding to UTF-8 as well and it just doesn't want to play nice.

Help!

Thanks, Mike

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.