You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Aeris <ae...@imirhil.fr> on 2011/10/09 23:59:26 UTC

Disable escaping on transformer

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I have a little problem with Xalan.

I use Transformer to create a HTML file from a Document.
But in generated HTML, all « & » in the document, which are parts of
already escaped HTML entities like « &nbsp; », are re-escaped by Xalan.

See this sample : http://pastebin.com/LfGpWMai
Instead of expected
	<div>&mdash;</div>
I get
	<div>&amp;mdash;</div>

I search on doc and Google, but nothing found to disable escaping.
How I can do this ?

Thanks
- -- 
Nicolas VINOT
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOkhk3AAoJEK8zQvxDY4P9iioIAL9v9bG/pbnhNA18iioMaLy6
AwrQFRy7k3L1Y92jrUf54crvFUYWj9tNPH9W0tUA/SShvvDQI1h7hulX5ZL64ijL
2M70nwkvFhh06mDyNwkIXJfm01oBc3OSJRqID/NGgarThVzp2Wjwte6qqLKOQTJS
REh8eVi8Ttu9DNnTR4VyH7GNbbyKDY0QjmNHZxD79LpLGEHRf9+ONxkn0SRvfAmJ
dSAozRXxyb7Mx65+DtOGCmHlk0407gbo9B38nPSE2WBYwaLSf6i+N8dlBnWxdVDn
xpuQnm0j3RRtuaTG/CRyWbEjO0es6EXK1dpg6oGyI0skiCglY1kX9OqGLiVYFZA=
=VKkB
-----END PGP SIGNATURE-----


Re: Disable escaping on transformer

Posted by Aeris <ae...@imirhil.fr>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Le 12/10/2011 21:29, Nathan Nadeau a écrit :
> Nicolas,
> 
> It seems you are not using anything specific to Xalan in your code at
> http://pastebin.com/LfGpWMai, though I may be missing something.

Hi,

Real implementation for javax.xml.transform.Transformer is
org.apache.xalan.transformer.TransformerImpl on my case.
And debug indicate this is this class which escape char.

So, i try my chance on this mailing-list, but same question was asked in
JDK mailing-list too.

> This behavior, according to your code, is actually expected.

I agree, this output is the expected most of cases.
But not my expectation =(

And because I use « transformer.setOutputProperty(OutputKeys.METHOD,
"html"); », I expect even more that transformer handles HTML entities
and not only XML ones.

> To disable entity resolving when reading in the source XML document, see
> DocumentBuilderFactory.setExpandEntityReferences().

Thanks a lot for this clue, I will investigate on this way.

> Entities and entity references can be quite tricky to work with, and you
> must understand what is happening at each level of the XML processing,
> from reading in the source XML, to running a transform on the XML, to
> outputting the final result.

Yes, escaping is just a pain in real world.
On my case, the text source is not on my scope…

- -- 
Nicolas VINOT
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOlexgAAoJEK8zQvxDY4P9rncIAMb0yY8u2a9wh7eqSP9U8UE2
664/Cf1M6tqkEbg5csqGO1j2WX5Fq87XLfXd2ACOvUyIVpqSbfXIqhRVFWbq00Sv
EYKvh4D05pkxyzEFgOEPMUEjmSKKcMcGCaV0QtVzn4sdV+WTNSKNV0Ckz5Ff5fLG
dBR3FwmbJChdK3lfbH6BIx2/L9b/JcWgcjSppsz/dlwT2URmZ7fsbi26LRjAeeW3
UshCs6oZ05KYnZVDqMW1ZBn6lwyVG2JKEZi96oZ7d7/TOt4kryUCwHrF8QFHFy34
yFtqzsRuZiBXLgYWeykhUBcolm9nEvaObyDUSM8Q10NZ69QiGe/MDE08/bdNNPU=
=hQBS
-----END PGP SIGNATURE-----


Re: Disable escaping on transformer

Posted by ke...@us.ibm.com.
If you want the HTML serializer to write out <foo>&mdash;</foo>, put a 
genuine unicode mdash character into the text and let the serializer deal 
with converting that to the correct format -- just as the parser converts 
it the other way, yielding a text node or character event containing that 
unicode character.

Let the tool do what it was designed to do. Don't try to second-guess it.

(Of course the serializer may decide to output the character as a numeric 
character escape instead of the human-readable entity name. But that's OK; 
it's still a correct representation of your document, and any software 
which cares about the distinction between those two renderings is, to put 
it simply,  broken.)


______________________________________
"You build world of steel and stone
I build worlds of words alone
Skilled tradespeople, long years taught:
You shape matter; I shape thought."
(http://www.songworm.com/lyrics/songworm-parody/ShapesofShadow.html)



From:
Nathan Nadeau <nd...@gleim.com>
To:
Aeris <ae...@imirhil.fr>
Cc:
xalan-j-users@xml.apache.org
Date:
10/12/2011 03:29 PM
Subject:
Re: Disable escaping on transformer



Nicolas,

It seems you are not using anything specific to Xalan in your code at 
http://pastebin.com/LfGpWMai, though I may be missing something.

This behavior, according to your code, is actually expected. You are 
creating a text node with the value "&mdash;" and wanting to output that 
in an XML file. In order to do this, the '&' must be escaped as "&amp;" 
in the output XML file. So the output is correct, though it is probably 
not what you want. When read in by other XML parsers, your created XML 
would contain an element called "div" with a text value of "&mdash;" 
(which is what you told it to have).

You can tell the class responsible for writing out the document to no 
longer escape special characters such as '&', though generally this is 
not preferred unless you have no other choice, at least according to 
best practices that I'm aware of. If you are reading in XML documents 
(instead of building DOM from scratch like in your example) you should 
also be able to tell the XML parser to not resolve entities in source 
document.

-----------

// this outputs <div>&mdash;</div> by telling StreamResult to disable 
output escaping via
// a processing instruction in the source DOM
final DocumentBuilder builder = 
DocumentBuilderFactory.newInstance().newDocumentBuilder();
final Document document = builder.newDocument();
final Node pi = 
document.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING,"");
final Node div = document.createElement("div");
document.appendChild(pi);
document.appendChild(div);
div.appendChild(document.createTextNode("&mdash;"));
final Transformer transformer = 
TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
final Writer out = new StringWriter();
StreamResult sr = new StreamResult(out);
transformer.transform(new DOMSource(document), sr);

-----------

To disable entity resolving when reading in the source XML document, see 
DocumentBuilderFactory.setExpandEntityReferences().

Entities and entity references can be quite tricky to work with, and you 
must understand what is happening at each level of the XML processing, 
from reading in the source XML, to running a transform on the XML, to 
outputting the final result.

Aeris wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I have a little problem with Xalan.
>
> I use Transformer to create a HTML file from a Document.
> But in generated HTML, all « & » in the document, which are parts of
> already escaped HTML entities like « &nbsp; », are re-escaped by Xalan.
>
> See this sample : http://pastebin.com/LfGpWMai
> Instead of expected
>                <div>&mdash;</div>
> I get
>                <div>&amp;mdash;</div>
>
> I search on doc and Google, but nothing found to disable escaping.
> How I can do this ?
>
> Thanks
> - -- 
> Nicolas VINOT
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJOkhk3AAoJEK8zQvxDY4P9iioIAL9v9bG/pbnhNA18iioMaLy6
> AwrQFRy7k3L1Y92jrUf54crvFUYWj9tNPH9W0tUA/SShvvDQI1h7hulX5ZL64ijL
> 2M70nwkvFhh06mDyNwkIXJfm01oBc3OSJRqID/NGgarThVzp2Wjwte6qqLKOQTJS
> REh8eVi8Ttu9DNnTR4VyH7GNbbyKDY0QjmNHZxD79LpLGEHRf9+ONxkn0SRvfAmJ
> dSAozRXxyb7Mx65+DtOGCmHlk0407gbo9B38nPSE2WBYwaLSf6i+N8dlBnWxdVDn
> xpuQnm0j3RRtuaTG/CRyWbEjO0es6EXK1dpg6oGyI0skiCglY1kX9OqGLiVYFZA=
> =VKkB
> -----END PGP SIGNATURE-----
>
> 

-- 
Nathan Nadeau
ndn@gleim.com
Software Development
Gleim Publications, Inc.
http://www.gleim.com



Re: Disable escaping on transformer

Posted by Nathan Nadeau <nd...@gleim.com>.
Nicolas,

It seems you are not using anything specific to Xalan in your code at 
http://pastebin.com/LfGpWMai, though I may be missing something.

This behavior, according to your code, is actually expected. You are 
creating a text node with the value "&mdash;" and wanting to output that 
in an XML file. In order to do this, the '&' must be escaped as "&amp;" 
in the output XML file. So the output is correct, though it is probably 
not what you want. When read in by other XML parsers, your created XML 
would contain an element called "div" with a text value of "&mdash;" 
(which is what you told it to have).

You can tell the class responsible for writing out the document to no 
longer escape special characters such as '&', though generally this is 
not preferred unless you have no other choice, at least according to 
best practices that I'm aware of. If you are reading in XML documents 
(instead of building DOM from scratch like in your example) you should 
also be able to tell the XML parser to not resolve entities in source 
document.

-----------

// this outputs <div>&mdash;</div> by telling StreamResult to disable 
output escaping via
// a processing instruction in the source DOM
final DocumentBuilder builder = 
DocumentBuilderFactory.newInstance().newDocumentBuilder();
final Document document = builder.newDocument();
final Node pi = 
document.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING,"");
final Node div = document.createElement("div");
document.appendChild(pi);
document.appendChild(div);
div.appendChild(document.createTextNode("&mdash;"));
final Transformer transformer = 
TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
final Writer out = new StringWriter();
StreamResult sr = new StreamResult(out);
transformer.transform(new DOMSource(document), sr);

-----------

To disable entity resolving when reading in the source XML document, see 
DocumentBuilderFactory.setExpandEntityReferences().

Entities and entity references can be quite tricky to work with, and you 
must understand what is happening at each level of the XML processing, 
from reading in the source XML, to running a transform on the XML, to 
outputting the final result.

Aeris wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I have a little problem with Xalan.
>
> I use Transformer to create a HTML file from a Document.
> But in generated HTML, all « & » in the document, which are parts of
> already escaped HTML entities like « &nbsp; », are re-escaped by Xalan.
>
> See this sample : http://pastebin.com/LfGpWMai
> Instead of expected
> 	<div>&mdash;</div>
> I get
> 	<div>&amp;mdash;</div>
>
> I search on doc and Google, but nothing found to disable escaping.
> How I can do this ?
>
> Thanks
> - -- 
> Nicolas VINOT
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJOkhk3AAoJEK8zQvxDY4P9iioIAL9v9bG/pbnhNA18iioMaLy6
> AwrQFRy7k3L1Y92jrUf54crvFUYWj9tNPH9W0tUA/SShvvDQI1h7hulX5ZL64ijL
> 2M70nwkvFhh06mDyNwkIXJfm01oBc3OSJRqID/NGgarThVzp2Wjwte6qqLKOQTJS
> REh8eVi8Ttu9DNnTR4VyH7GNbbyKDY0QjmNHZxD79LpLGEHRf9+ONxkn0SRvfAmJ
> dSAozRXxyb7Mx65+DtOGCmHlk0407gbo9B38nPSE2WBYwaLSf6i+N8dlBnWxdVDn
> xpuQnm0j3RRtuaTG/CRyWbEjO0es6EXK1dpg6oGyI0skiCglY1kX9OqGLiVYFZA=
> =VKkB
> -----END PGP SIGNATURE-----
>
>   

-- 
Nathan Nadeau
ndn@gleim.com
Software Development
Gleim Publications, Inc.
http://www.gleim.com