You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Colosi, John" <jc...@verisign.com> on 2001/11/09 01:39:01 UTC

HELP ME! - Utf question

All,

	Let me refine my explanation.  Dimitry, your feedback has been very
helpful so far.  I would really appreciate any other feedback as well.

	Given:
		  UTF-8      UTF-16
		e5 9e be  =   57be



	Now, consider the following:


	#1
		<abc>åz¾</abc>
		The <abc> element contains the values "e5", "9e", and "be"
inside the brackets, but the values are in a raw binary format.  The Xerces
parser assumes these values are UTF-8 and converts them to UTF-16 (Unicode).
A Java string of length 1 (one) is constructed whose value is 0x57be (see
"Given" above)


	#2
		<abc>&#xe5;&#x9e;&#xbe;</abc>
		Now the <abc> element contains the hex values "e5", "9e",
and "be".  These values however are specified as hex values and are not
interpreted as UTF-8.  A Java String of length 3 (three) is constructed
whose value is 0x00e5, 0x009e, 0x00be.


	The input data is identical in the two cases.  In both, the user
wishes to specify the hex data "e5 9e be".  The parser handles the data
differently depending on the method of input resulting in different output.
Is there any way to rectify this?

thanks,
-- John

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org