You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Michael Mealling <mi...@bailey.dscga.com> on 2001/02/02 17:12:22 UTC

getNodeValue, createTextNode and UTF-8

I can't find this documentation anywhere (unless I''m blind _and_ stupid):

I parse some XML that has some interesting Unicode stuff in a particular
text node and I need to get it out and back into a new XML document.
This has created some questions:

Does Xerces-J convert the UTF-8 in the XML file into Java's internal
character encoding? I.e. if I do this:

String foo = node.getNodeValue();

Will foo contain the Unicode as UCS-2 or as UTF-8? If it contains UCS-2
I assume I can convert it to UTF-8 using String.getBytes("UTF-8"). Once
I do this how do I put that UTF-8 back into a new XML Document and make
sure the UTF-8 is still preserved all the way through? (note: I do
some processing in the middle on the string so I can't just 'copy' the
node over.)

-MM

P.S. I'm desperate enough at this point to be willing to pay whoeve
has the answer....

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: getNodeValue, createTextNode and UTF-8

Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Fri, Feb 02, 2001 at 11:02:18AM -0800, Arnaud Le Hors wrote:
> Lynn Monson wrote:
> > > String foo = rset.getString(1);
> > > element.appendChild(document.createTextNode(foo));
> > 
> > Yes, that's correct.
> > 
> > > ...
> > > the problem is: I still get garbage on the client end of the servlet.
> > 
> > I suspect that the serializer is doing the right thing
> 
> Hmm... I wouldn't be so sure. But, Michael, if you serialize the
> original document back is it in good shape? The JDK is full of bugs when
> it comes to I18N support. The error could simply be in String itself or
> the Readers/Writers...

It does appear to actually be doing the correct thing. I serialized
it out to a file as well as to the ServletOutputStream and the file
looks right while the ServletOutputStream gets corrupted. I'm 
currently tracking down bug reports and version upgrades for JServ
and mod_jserv (gawd, I hate mod_jserv...). 

I'm reserving the beer shipments until I really get it working but
I figure I'll split it between the three of you. ;-)

-MM

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: getNodeValue, createTextNode and UTF-8

Posted by Arnaud Le Hors <le...@us.ibm.com>.
Lynn Monson wrote:
> 
> > String foo = rset.getString(1);
> > element.appendChild(document.createTextNode(foo));
> 
> Yes, that's correct.
> 
> > ...
> > the problem is: I still get garbage on the client end of the servlet.
> 
> I suspect that the serializer is doing the right thing

Hmm... I wouldn't be so sure. But, Michael, if you serialize the
original document back is it in good shape? The JDK is full of bugs when
it comes to I18N support. The error could simply be in String itself or
the Readers/Writers...
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

RE: getNodeValue, createTextNode and UTF-8

Posted by Lynn Monson <lm...@flipdog.com>.
> String foo = rset.getString(1);
> element.appendChild(document.createTextNode(foo));

Yes, that's correct.

> ...
> the problem is: I still get garbage on the client end of the servlet.

I suspect that the serializer is doing the right thing and that the
problem lies in the encoding used by the HTTP stream.  It may be, for
example,
that the serializer is producing the correct UTF-8 sequence, but those bytes
are
being re-interpreted according to the encoding of the HTTP stream.



Re: getNodeValue, createTextNode and UTF-8

Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Fri, Feb 02, 2001 at 10:38:10AM -0700, Lynn Monson wrote:
> I might be misunderstanding the thread of the question, but in case it's
> helpful, here's my take on the situation:

There's so little good documentation out there that any knowledgable
perspective is valuable....

> > But Unicode in that particular JVM's encoding format. In the case
> > of Sun's JDK its UCS-2. Hence you can't just use a String someplace
> > where you need UTF-8.
> > ...
> > Correct. And this works. Except that I'm no writing it out to a file.
> > I'm taking that UTF-8 bit I pulled out and I'm putting it into another
> > TextNode in a different XML Document.
> 
> This might be where some confusion lies.  A Java Character object is always
> a 2-byte unicode quantity.  Java strings are sequences of the same.  While
> running in the JVM, a String never has any other encoding.  It's possible to
> serialize a string into any number of other encodings, but that's primarily
> a function of the serialization, not the string.  Converting the string into
> a UTF-8 byte sequence, for example, is just such a serialization.  It's a
> transformation from Java's unicode string into a UTF-8 byte sequence.

Yep. So node.getNodeValue() in Xerces will honor the XML encoding
spec so what ends up in the String is a valid Unicode string in Java's
internal format....

> When operating within DOM, you very rarely need to convert anything. In
> general (though not always), the encoding of the XML characters only matters
> while parsing the file or serializing it out.  When the document is
> represented as a DOM tree, you can simply pass the Java strings (in unicode)
> as API arguments between dom APIs and between DOM trees.  This is true even
> if two dom trees had different encodings when they were parsed and even if
> you are going to serialize them out with different encodings.

So let's get to some specifics. Let's say I have some unicode strings
in an Oracle database in languages ranging from Chinese to Sanskrit.
I do a select and the JDBC implementation (supposedly) andles getting 
them into Java's internal format. In that case I _should_ be able to do this:

String foo = rset.getString(1);
element.appendChild(document.createTextNode(foo));

Then, when I'm done filling out the rest of the document I do this:
        OutputFormat format = new OutputFormat(document);
	// I supply it by hand in the OutputStream
        format.setOmitXMLDeclaration(true);
        // it makes debugging easier
        format.setIndenting(true);
	// the default serialization encoding is UTF-8 so we don't set it
        XMLSerializer serializer = new XMLSerializer(out, format);
	// out is a ServletOutputStream
        serializer.serialize(document.getDocumentElement());

the problem is: I still get garbage on the client end of the servlet.

-MM

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: getNodeValue, createTextNode and UTF-8

Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Fri, Feb 02, 2001 at 09:29:29AM -0800, Milind Gadre wrote:
> > But Unicode in that particular JVM's encoding format. In the case
> > of Sun's JDK its UCS-2. Hence you can't just use a String someplace
> > where you need UTF-8.
> 
> Why not use the java.lang.String(byte[] data, String enc) constructor to
> create a new string in the desired encoding and then hand off to the
> other XML doc?
> 
> (a) byte[] oldBytes = oldString.getBytes("UTF8")
> (b) newString = new String(oldBytes, "UTF8")
> (c) otherXMLDocument.addNodeInUTF8(newString)

I tried something similar. Actually that shouldn't do anything.

Re: getNodeValue, createTextNode and UTF-8

Posted by Milind Gadre <mi...@ecplatforms.com>.
> But Unicode in that particular JVM's encoding format. In the case
> of Sun's JDK its UCS-2. Hence you can't just use a String someplace
> where you need UTF-8.

Why not use the java.lang.String(byte[] data, String enc) constructor to
create a new string in the desired encoding and then hand off to the
other XML doc?

(a) byte[] oldBytes = oldString.getBytes("UTF8")
(b) newString = new String(oldBytes, "UTF8")
(c) otherXMLDocument.addNodeInUTF8(newString)


Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com



Re: getNodeValue, createTextNode and UTF-8

Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Fri, Feb 02, 2001 at 09:44:20AM -0800, Arnaud Le Hors wrote:
> Michael Mealling wrote:
> > 
> > Correct. And this works. Except that I'm no writing it out to a file.
> > I'm taking that UTF-8 bit I pulled out and I'm putting it into another
> > TextNode in a different XML Document.
> 
> I'm not an I18N expert but now I'm really puzzled. If you're not writing
> to a file why do you have to bother with encoding at all? 

Because when I serialize the second document those particular nodes
aren't in UTF-8.

> I mean, the DOM exposes text in UTF-16, which is what you get with a Java 
> String.  Taking one String from a document and putting it back in another
> document should just work. What is it that doesn't work?

The serialization of the second document to a ServletOutputStream...

-MM

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: getNodeValue, createTextNode and UTF-8

Posted by Arnaud Le Hors <le...@us.ibm.com>.
Michael Mealling wrote:
> 
> Correct. And this works. Except that I'm no writing it out to a file.
> I'm taking that UTF-8 bit I pulled out and I'm putting it into another
> TextNode in a different XML Document.

I'm not an I18N expert but now I'm really puzzled. If you're not writing
to a file why do you have to bother with encoding at all? I mean, the
DOM exposes text in UTF-16, which is what you get with a Java String.
Taking one String from a document and putting it back in another
document should just work. What is it that doesn't work?
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

Re: getNodeValue, createTextNode and UTF-8

Posted by Milind Gadre <mi...@ecplatforms.com>.
I agree with Lynn. Michael, you may be trying to over-engineer the
situation. Within Java, keep everything as String, and use String to
exchange node data between XML documents. When writing the document out,
use a specific encoding.

Maybe, you want different encodings for different nodes. In that case,
you need to keep track of which node has which encoding, and use that to
determine how to write the nodes out.

Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com

----- Original Message -----
From: "Lynn Monson" <lm...@flipdog.com>
To: <xe...@xml.apache.org>; <mi...@netsol.com>; "Milind Gadre"
<mi...@ecplatforms.com>
Sent: Friday, February 02, 2001 9:38 AM
Subject: RE: getNodeValue, createTextNode and UTF-8


> I might be misunderstanding the thread of the question, but in case
it's
> helpful, here's my take on the situation:
>
> > But Unicode in that particular JVM's encoding format. In the case
> > of Sun's JDK its UCS-2. Hence you can't just use a String someplace
> > where you need UTF-8.
> > ...
> > Correct. And this works. Except that I'm no writing it out to a
file.
> > I'm taking that UTF-8 bit I pulled out and I'm putting it into
another
> > TextNode in a different XML Document.
>
> This might be where some confusion lies.  A Java Character object is
always
> a 2-byte unicode quantity.  Java strings are sequences of the same.
While
> running in the JVM, a String never has any other encoding.  It's
possible to
> serialize a string into any number of other encodings, but that's
primarily
> a function of the serialization, not the string.  Converting the
string into
> a UTF-8 byte sequence, for example, is just such a serialization.
It's a
> transformation from Java's unicode string into a UTF-8 byte sequence.
>
> When operating within DOM, you very rarely need to convert anything.
In
> general (though not always), the encoding of the XML characters only
matters
> while parsing the file or serializing it out.  When the document is
> represented as a DOM tree, you can simply pass the Java strings (in
unicode)
> as API arguments between dom APIs and between DOM trees.  This is true
even
> if two dom trees had different encodings when they were parsed and
even if
> you are going to serialize them out with different encodings.
>
> Hope that helps.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>
>


RE: getNodeValue, createTextNode and UTF-8

Posted by Lynn Monson <lm...@flipdog.com>.
I might be misunderstanding the thread of the question, but in case it's
helpful, here's my take on the situation:

> But Unicode in that particular JVM's encoding format. In the case
> of Sun's JDK its UCS-2. Hence you can't just use a String someplace
> where you need UTF-8.
> ...
> Correct. And this works. Except that I'm no writing it out to a file.
> I'm taking that UTF-8 bit I pulled out and I'm putting it into another
> TextNode in a different XML Document.

This might be where some confusion lies.  A Java Character object is always
a 2-byte unicode quantity.  Java strings are sequences of the same.  While
running in the JVM, a String never has any other encoding.  It's possible to
serialize a string into any number of other encodings, but that's primarily
a function of the serialization, not the string.  Converting the string into
a UTF-8 byte sequence, for example, is just such a serialization.  It's a
transformation from Java's unicode string into a UTF-8 byte sequence.

When operating within DOM, you very rarely need to convert anything. In
general (though not always), the encoding of the XML characters only matters
while parsing the file or serializing it out.  When the document is
represented as a DOM tree, you can simply pass the Java strings (in unicode)
as API arguments between dom APIs and between DOM trees.  This is true even
if two dom trees had different encodings when they were parsed and even if
you are going to serialize them out with different encodings.

Hope that helps.


Re: getNodeValue, createTextNode and UTF-8

Posted by Michael Mealling <mi...@bailey.dscga.com>.
On Fri, Feb 02, 2001 at 09:14:00AM -0800, Milind Gadre wrote:
> I had to wade through this issue just a couple of weeks ago. I was using
> XML for storing localized text messages in different languages such as
> Japanese, German etc. My understanding of the process is as follows:
> 
> [1] Input XML file has certain encoding specified in the 'encoding'
> attribute.

Yep. Specified in the <?xml> banner and defaults to UTF-8

> [2] Parser reads the XML file and converts everything to
> java.lang.String which is Unicode. Hence, getNodeValue will return
> Unicode.

But Unicode in that particular JVM's encoding format. In the case
of Sun's JDK its UCS-2. Hence you can't just use a String someplace
where you need UTF-8.

> [3] When you want to write the parsed text out to a file, you have to
> decide what encoding the output will be in - eg for the OutputStream or
> Writer objects. Depending on the OS, the JVM has a default encoding -
> which is usually quite useless for multibyte text. So use the
> java.lang.String#getBytes(String encoding) method to get your Unicode
> String data in a byte[] format, and then write the byte[] out to the
> file.

Correct. And this works. Except that I'm no writing it out to a file.
I'm taking that UTF-8 bit I pulled out and I'm putting it into another
TextNode in a different XML Document.

> OK - now where do I send my bill :-)

Ah... But the problem isn't fixed yet! ;-) Seriously, whoever
gets this working for me will get a case of their favorite beer shipped
to 'em. 

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions	|          www.lp.org          |  michaelm@netsol.com

Re: getNodeValue, createTextNode and UTF-8

Posted by Milind Gadre <mi...@ecplatforms.com>.
I had to wade through this issue just a couple of weeks ago. I was using
XML for storing localized text messages in different languages such as
Japanese, German etc. My understanding of the process is as follows:

[1] Input XML file has certain encoding specified in the 'encoding'
attribute.

[2] Parser reads the XML file and converts everything to
java.lang.String which is Unicode. Hence, getNodeValue will return
Unicode.

[3] When you want to write the parsed text out to a file, you have to
decide what encoding the output will be in - eg for the OutputStream or
Writer objects. Depending on the OS, the JVM has a default encoding -
which is usually quite useless for multibyte text. So use the
java.lang.String#getBytes(String encoding) method to get your Unicode
String data in a byte[] format, and then write the byte[] out to the
file.

Supported encoding strings are available at


http://java.sun.com/products/jdk/1.1/docs/guide/intl/intlTOC.doc.html

OK - now where do I send my bill :-)

Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com

----- Original Message -----
From: "Michael Mealling" <mi...@bailey.dscga.com>
To: <xe...@xml.apache.org>
Sent: Friday, February 02, 2001 8:12 AM
Subject: getNodeValue, createTextNode and UTF-8


> I can't find this documentation anywhere (unless I''m blind _and_
stupid):
>
> I parse some XML that has some interesting Unicode stuff in a
particular
> text node and I need to get it out and back into a new XML document.
> This has created some questions:
>
> Does Xerces-J convert the UTF-8 in the XML file into Java's internal
> character encoding? I.e. if I do this:
>
> String foo = node.getNodeValue();
>
> Will foo contain the Unicode as UCS-2 or as UTF-8? If it contains
UCS-2
> I assume I can convert it to UTF-8 using String.getBytes("UTF-8").
Once
> I do this how do I put that UTF-8 back into a new XML Document and
make
> sure the UTF-8 is still preserved all the way through? (note: I do
> some processing in the middle on the string so I can't just 'copy' the
> node over.)
>
> -MM
>
> P.S. I'm desperate enough at this point to be willing to pay whoeve
> has the answer....
>
> --
> ----------------------------------------------------------------------
----------
> Michael Mealling |      Vote Libertarian!       |
www.rwhois.net/michael
> Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:
14198821
> Network Solutions |          www.lp.org          |
michaelm@netsol.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>
>