You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Colosi, John" <jc...@verisign.com> on 2001/11/08 21:45:48 UTC
RE: Outfit question

Hi Dimitry,

I'm still a little confused.  I understand the reasoning behind the
conversion to Unicode for the Java string.  But I'm not seeing the same
conversion when I input utf-8 using method (2) from below.  Are you saying
that my application should not support input using method (1) from below.
Should I not allow users to input Raw utf-8 into the XML doc?

thanks again,
-- John

-----Original Message-----
From: Voytenko, Dimitry [mailto:DVoytenko@SECTORBASE.COM]
Sent: Thursday, November 08, 2001 3:28 PM
To: 'xerces-j-user@xml.apache.org'
Subject: RE: Utf8 question


Hi John,

> <abc>$#@%$#@^$#</abc>
>    (here the element value is raw utf8)

According to DOM interfaces values of text nodes are represented by String
(org.w3c.dom.Text.getNodeValue() returns String). Since String internally is
array of char and Java's char is always in Unicode, any characters will be
converted to Unicode while bulding DOM.
According to SAX interfaces DocumentHandler.characters,
ContentHandler.characters, etc have array of chars (char[]) as a first
parameter. So you characters will be converted to Unicode again.
So I'm afraid you won't be able to leave UTF-8 or other characters, because
in this case you'll need to operate with byte[] arrays, which are not
supported by any XML interface.

Thanks,
Dmitry

-----Original Message-----
From: Colosi, John [mailto:jcolosi@verisign.com]
Sent: Thursday, November 08, 2001 05:50
To: 'xerces-j-user@xml.apache.org'
Subject: RE: Utf8 question


Thanks for the response Andy.
I'm writing an application which requires a utf8 value.  I think this value
can be input in two ways:

1)

<abc>$#@%$#@^$#</abc>
   (here the element value is raw utf8)



or



2)

<abc>&#xe5;&#x9e;&#xbe;</abc>
   (here the element value is utf8 written using the hex notation.


In the first example, the parser is modifying the utf-8 and returning to me
a Java string containing utf-16.  In the second example, the Java string I
get is just the exact binary that I entered (because the parser makes no
assumption about the binary data).

So how can my application know whether it's looking at utf-8 or utf-16
because it can't really know how the parser handled the input?

Any help is appreciated.

thanks,
-- John

-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Thursday, November 08, 2001 12:32 AM
To: xerces-j-user@xml.apache.org
Subject: Re: Utf8 question


"Colosi, John" wrote:
>         It looks like the Xerces parser is converting incoming UTF-8 to
> UTF-16 automatically during the parse.

Since Java uses UTF16 internally, wouldn't this be what
it's supposed to do? Or maybe I'm not understanding what
you mean. Please provide some more detailed information.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org