You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by ne...@ca.ibm.com on 2001/12/17 22:39:42 UTC
Re: UTF Encoded XML docs and Xerces SAX (1.4.4)

Hi Ashish,

Unfortunately it's not clear to me what you're after here.  Shouldn't
&#228; map to a single unicode character--i.e., not bbe split between two
chars?

It might be useful to make sure you're viewing the result of the characters
(...) callback in the same way that the browser is displaying it.  Unicode
needs to be rendered before it's displayed, so , depending on what you're
doing, your rendering procedure could be giving you false results.

Perhaps you could send your xml file with the output you're expecting and
the output Xerces gives you--the sax.SAXWriter sample could help here.
Please send this in a zip file so that the text files avoid the vaguaries
of e-mail clients, notorious for making messes of non-ASCII text.  Finally,
I'd recommend trying xerces2; if there really is a bug here the chances of
it getting fixed in that codebase are dramatically better than they are if
you stay with xerces1.

Cheers,
Niel

Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  neilg@ca.ibm.com



Ashish Rahurkar <As...@crossworlds.com> on 12/17/2001 01:24:03 PM

Please respond to xerces-j-user@xml.apache.org

To:   "'xerces-j-user@xml.apache.org'" <xe...@xml.apache.org>
cc:
Subject:  UTF Encoded XML docs and  Xerces SAX (1.4.4)


I have an xml document (utf encoded) with german character (e.g: ä )

In the eg. below the Umlaut character ' ä ' is UTF encoded.
The XML file looks like this ..

<?xml version="1.0" encoding="UTF-8"?>
<!-- This came from sample poll servlet -->
<!DOCTYPE X >
<X Attrib1="Attrib1Info" Attrib2="Attrib2Data" Attrib3="Attrib3Info" >
<CD>
<C attrib1="testattrib1" attrib2="testattrib2" >UTF Encoded Umlaut
character
Ã¤</C>
</CD>
</X>


When I parse the document with Xerces (SAX) I see that the parser does not
return the character ä in the characters(char ch[], int start, int length)
callback method.
What I expect to receive in the characters array is "UTF Encoded Umlaut
character ä" in more than one chunks or one long chunk. Instead I get the
char's exactly as they appear in the xml doc :
"UTF Encoded Umlaut character Ã¤".

Why is the parser not able to return me the correct unicode characters when
all parsers are supposed to support UTF-8 encoding?


If instead of the UTF code for ä I have &#228; (escape it with the
character
reference) then the parser is able to recognize and returns the correct
string in two chunks
char array chunk 1: UTF Encoded Umlaut character
char array chunk 2: ä

When I used IE or other xml viewers to view the xml they correctly
interpreted UTF encoding and display the xml with german characters.

Is there a bug in Xerces SAX or am I missing something?

Thanks
Ashish


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org