You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "F. Andy Seidl" <fa...@myst-technology.com> on 2004/05/07 20:40:33 UTC

How does document encoding affect DOMString character values in a resulting DOM? (possible Xerces bug)

I am uncertain whether the behavior I am seeing in the Xerces DOM parser
(2.6.2) is correct.  Specifically, I am unclear as to what character values
should appear in a DOM string after parsing a document that uses a character
encoding such as ISO-8859-1 or Windows-1252.
Here is a specific example to illustrate the question:
Suppose a document that specifies encoding="ISO-8859-1" contains a byte
value 0x93 as part of the text content of an element.  This is a double left
quote character (a "smart quote" in Windows terminology).  This is a legal
character for the encoding.  However, the Unicode index for LEFT DOUBLE
QUOTATION MARK is 0x201C.
So, once this document is parsed into a DOM, should the DOM contain the
character value 0x93 or the Unicode value 0x201C?
Based on the DOM Level 2 Core specification, it seems the DOM should contain
0x201C because the spec says, "Applications must encode DOMString using
UTF-16 (defined in [Unicode] and Amendment 1 of [ISO/IEC 10646])."
See http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578
However, after parsing the document with Xerces, the DOM contains the
character value 0x93 from the original source document (which, in Unicode,
is a "set transmit state" control character and not a left double quote).
Is this a Xerces bug?  If so, can anyone offer advice as to where to look in
the Xerces source to start debugging?
Thanks,
  -- fas
F. Andy Seidl, Co-founder
MyST Technology Partners
Creators of MySmartChannels




---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


How to get namespace prefix mapping in schema?

Posted by Bob Foster <bo...@objfac.com>.
While processing an XSSimpleTypeDefinition that has enumeration values 
which are QNames, I need to know the namespace URI that corresponds to 
the QName prefix used to define the values in the schema. How do I get 
access to that?

Thanks.

Bob Foster
http://xmlbuddy.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: How does document encoding affect DOMString character values in a resulting DOM? (possible Xerces bug)

Posted by Joseph Kesselman <ke...@us.ibm.com>.



The DOMString's internal encoding is always unicode, specifically UTF-16.
The parser should convert from the XML text's encoding to that form; the
serializer should convert to whatever output text encoding is being used.

______________________________________
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The world changed profoundly and unpredictably the day Tim Berners Lee
got bitten by a radioactive spider." -- Rafe Culpin, in r.m.filk


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org