You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Artur Tomusiak <ar...@hannonhill.com> on 2009/04/14 00:20:17 UTC

How to preserve numeric entities when converting xml String to a org.w3c.dom.Document ?

Hello,

I am trying to convert a String with XML content in it into the 
org.w3c.dom.Document object to do some modifications and then to convert 
it back to the String. However, even if I do not do any modifications to 
the object, I am still getting back a different String than what I have 
provided as an input. The problem is with the numeric XML entities. For 
example, if my input String is:

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    &#169;
    &#38;   
</xml>

Once I convert this to an org.w3c.dom.Document object and then back to 
String, I am getting this as a result:

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    ©
    &amp;   
</xml>

After looking more closely, I realized that the org.w3c.dom.Document 
object already contains the converted text, which means the problem lies 
in conversion from the String to Document, and not when converting back 
from Document to String.

Please let me know (an example code would be very appreciated) how can I 
do the described conversions while preserving the numeric entities in 
the XML.

Thanks,
Artur


-- 
Artur Tomusiak
(678) 904-6900 ext 140
Hannon Hill - CMS Experience You Can Trust
http://www.hannonhill.com


Re: How to preserve numeric entities when converting xml String to a org.w3c.dom.Document ?

Posted by ke...@us.ibm.com.
The simple answer is "Sorry, but those two forms are absolutely identical 
in meaning as far as XML is concerned. If you're going through XML-based 
processing, either output is correct. Standard tools aren't going to 
maintain this distinction."

The longer answer is that you could postprocess the XML-syntax text, or 
write your own serializer, to force the output into this form.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: How to preserve numeric entities when converting xml String to a org.w3c.dom.Document ?

Posted by Michael Ludwig <ml...@as-guides.com>.
Artur Tomusiak schrieb:
>
> I am trying to convert a String with XML content in it into the
> org.w3c.dom.Document object to do some modifications and then to
> convert it back to the String. However, even if I do not do any
> modifications to the object, I am still getting back a different
> String than what I have provided as an input. The problem is with
> the numeric XML entities. For example, if my input String is:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xml>
>    &#169;
>    &#38;   </xml>

Hi Artur,

in fact, and to be pedantic, these are neither entities nor entity
references; they're numerical character references; they just happen
to use the same syntax as general entity references. (See XML spec
if interested.)

As keshlam said, these are 100 % identical as far as XML is concerned.

It's not clear to me whether you use XSLT at all or only the DOM.
I'm assuming you're using XSLT.

When transforming to a DOM target, the XSLT serialization instruction
like <xsl:output encoding="US-ASCII"/> is disregarded.

If all you want is a string, there is no point in transforming to the
DOM. In that case, simply specify <xsl:output encoding="US-ASCII"/> in
your stylesheet. That would force numerical character references for
non-ASCII characters.

But the characters in your example are ASCII characters, and I do not
know of a way to have them serialized as numerical character references
in XSLT 1.0. Use Perl or AWK or some other general text processing tool
to postprocess your output.

Michael Ludwig