You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Ashish Rahurkar <As...@crossworlds.com> on 2001/12/17 19:24:03 UTC

UTF Encoded XML docs and Xerces SAX (1.4.4)

I have an xml document (utf encoded) with german character (e.g: ä )
 
In the eg. below the Umlaut character ' ä ' is UTF encoded.
The XML file looks like this ..
 
<?xml version="1.0" encoding="UTF-8"?>
<!-- This came from sample poll servlet -->
<!DOCTYPE X >
<X Attrib1="Attrib1Info" Attrib2="Attrib2Data" Attrib3="Attrib3Info" >
<CD>
<C attrib1="testattrib1" attrib2="testattrib2" >UTF Encoded Umlaut character
Ã¤</C>
</CD>
</X>
 
 
When I parse the document with Xerces (SAX) I see that the parser does not
return the character ä in the characters(char ch[], int start, int length)
callback method.
What I expect to receive in the characters array is "UTF Encoded Umlaut
character ä" in more than one chunks or one long chunk. Instead I get the
char's exactly as they appear in the xml doc :
"UTF Encoded Umlaut character Ã¤". 
 
Why is the parser not able to return me the correct unicode characters when
all parsers are supposed to support UTF-8 encoding?
 
 
If instead of the UTF code for ä I have &#228; (escape it with the character
reference) then the parser is able to recognize and returns the correct
string in two chunks
char array chunk 1: UTF Encoded Umlaut character 
char array chunk 2: ä
 
When I used IE or other xml viewers to view the xml they correctly
interpreted UTF encoding and display the xml with german characters.
 
Is there a bug in Xerces SAX or am I missing something?
 
Thanks
Ashish
 

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org