You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Murray Altheim <m....@open.ac.uk> on 2004/09/03 23:44:14 UTC

CDATA section behaviour in XHTML serializer

Hi,

I searched the archives but was unable to locate any questions
and answers on this subject, so my apologies if this has been
covered before.

In one of the recent releases of Xerces-J, the XHTML serializer's
behaviour was changed to always escape <script> and <style>
elements' content using CDATA sections. While this is certainly
permissable in XHTML (and recommended if the character content of
the element contains problematic characters, see [1]), this
behaviour is not always warranted or welcome. It seems that it
should be an optional serialization behaviour, rather than
hardwired in, or perhaps made sensitive to the specific character
data content of the element to be serialized.

In revision $Revision: 1.26 $ $Date: 2004/02/16 05:24:55 $ of
org.apache.xml.serializer.HTMLSerializer, on lines 395-404, and
again on lines 623-632, there's an if statement that automatically
tags <style> and <script> elements to be escaped as CDATA sections.

      if ( tagName.equalsIgnoreCase( "SCRIPT" ) ||
            tagName.equalsIgnoreCase( "STYLE" ) ) {
           if ( _xhtml ) {
               // XHTML: Print contents as CDATA section
               state.doCData = true;
           } else {
               // HTML: Print contents unescaped
               state.unescaped = true;
           }
       }

If just prior to *actually* performing the CDATA escaping, in
characters(String) and characters(char[],int,int), the character
data content were checked for the presence of '<' and '&'
characters, then state.doCData could safely be set false.

Alternately, a boolean option could be provided to turn off the
default behaviour, though this latter suggestion does have the
downside of potentially creating invalid serializations.

Because these CDATA sections are actually causing problems for
my own project, I've had to disable the setting of state.doCData
and just avoid invalid characters in my script and style elements.
Obviously this is a less than adequate solution, and I'm just
wondering if others have run across this.

Thanks,

Murray

[1] http://www.w3.org/TR/xhtml1/#h-4.8
......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   The North American Aerospace Defense Command had gone so far
   as to develop exercises to counter the threat [of flying jets
   into skyscrapers] and, according to a Defense Department memo-
   randum unearthed by the commission, planned a drill in April
   2001 that would have simulated a terrorist crash into the Pentagon.
   http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2004/07/25/MNG6S7SR421.DTL


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: CDATA section behaviour in XHTML serializer

Posted by Murray Altheim <m....@open.ac.uk>.
Michael Glavassevich wrote:
> Hello Murray,
> 
> The code you cited appears to be over four years old [1], so this doesn't 
> sound like new behaviour. Regardless of when this was introduced, it isn't 
> likely to change now. Both the HTMLSerializer and XHTMLSerializer were 
> deprecated in Xerces 2.6.2. We've been encouraging users to migrate their 
> code to use the standard JAXP Transformation API. If you're interested 
> about the future of Xerces serializers see this post [2] to xalan-dev from 
> February of this year.
> 
> [1] 
> http://cvs.apache.org/viewcvs.cgi/xml-xerces/java/src/org/apache/xml/serialize/HTMLSerializer.java?rev=1.10&view=markup
> [2] http://marc.theaimsgroup.com/?l=xalan-dev&m=107593381313807&w=2

Michael,

Thanks very much for this information, I wasn't aware of this switch
at all. I'd originally written my own serializer, and converted to
using the Xerces serializer last fall. I'll look into converting
over to Xalan's code. Hopefully it provides the necessary controls
that are there in the Xerces serializer. Xalan is already part of my
code base since I'm also using Xindice, so at least I won't be
increasing my jar library by 3MB just for the serializer. It would
be unfortunate for anyone who only needed serialization.

Thanks very much for your help,

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   News Headlines from September 1st, 2004:

     Schwarzenegger cheers for Bush, says 'America is back'
     Bush reverses himself, says war on terror can be won

     At Least 900 Arrested in NY City as Protesters Clash With Police
     Talks to Disarm Rebel Shiites Collapses in Iraq
     Iraq assembly opens amid mortar fire
     Iraqi education official assassinated in Kirkuk
     Bomb did not hit Afghan villagers, says US
     One US soldier killed, two wounded by roadside bomb near Mosul
     US Army convoy ambushed near Balad
     Four Iraqis killed, seven wounded in mortar attack near Samarra
     Four Iraqis killed, five wounded in US airstrike near Samarra
     Three Iraqi policemen wounded in ambush near Kirkuk
     Fighting reported near Fallujah
     US troops under heavy mortar attack near Baquba
     Insurgent attacks near Basra stop oil exports
     US patrol ambushed near Mosul
     Twelve Nepalese hostages executed by insurgents
     Executions in Iraq trigger rioting in Nepal
     North Koreans storm Japanese school
     Hostage Crisis Unfolds in Russia as Guerrillas Seize School
     Moscow suicide bomber kills 10 and injures 51
     Bomb traces in both Russian jets
     UN says Sudan failing on Darfur
     2 suicide bombs kill 16 in Israel
     Israel vows 'global' war on Hamas
     Hundreds of Palestinians resume hunger strike in Israeli prison
     Three die in Saudi Ikea stampede
     Female GI in Iraq abuse case awaits judge's ruling
     High court scrutinizes Gov. Bush

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: CDATA section behaviour in XHTML serializer

Posted by Murray Altheim <m....@open.ac.uk>.
Michael Glavassevich wrote:
> Hello Murray,
> 
> The code you cited appears to be over four years old [1], so this doesn't 
> sound like new behaviour. Regardless of when this was introduced, it isn't 
> likely to change now. Both the HTMLSerializer and XHTMLSerializer were 
> deprecated in Xerces 2.6.2. We've been encouraging users to migrate their 
> code to use the standard JAXP Transformation API. If you're interested 
> about the future of Xerces serializers see this post [2] to xalan-dev from 
> February of this year.
> 
> [1] 
> http://cvs.apache.org/viewcvs.cgi/xml-xerces/java/src/org/apache/xml/serialize/HTMLSerializer.java?rev=1.10&view=markup
> [2] http://marc.theaimsgroup.com/?l=xalan-dev&m=107593381313807&w=2

Michael,

Well, having investigated via trial implementation for the past few
days, I frankly can't say I'm very pleased with the changes. I went
from using my own code for serialization (a single class) to using
the original Xerces serializer (not too complicated) to using the
latest Xerces serializer (which now requires the whole factory
thing whilst not really providing any additional functionality,
just more complexity), then switching over to the Xalan code (I've
now lost some features, and found that it doesn't actually work,
as I'm getting some NoSuchMethodErrors, and I think all my jars are
completely up to date), and now, converting over to JAXP
Transformation API, I must download a 48MB shell script from Sun
just to obtain the jaxp-api.jar file (27K), plus I can no longer
freely distribute my entire application since I'm not allowed to
redistribute the 27K JAXP file (an API!). My target users are
non-technical, so I can barely expect them to get the latest version
of Java, much less download and install the JWSDP.

Having been a happy employee of Sun, I don't have any anti-Sun bias,
but we've gone from the simple idea of serializing a DOM tree to
an amazingly baroque solution that is no longer open source and
requires a *huge* download. We may have gained some interoperability
in the process, but I can't call that progress. I much prefer the
Xerces solution, even if I have to hack it to get it to do what I
want for CDATA sections. Too bad it was dumped in favour of Xalan.
And if Xalan comes up to fully supporting all of the Xerces
features, I still think moving to JAXP is a mistake. Even if I
have to rely on "proprietary" Apache code (I'm doing it already
just about everywhere), I prefer that over requiring the JWSDP.

Not to sound *too* negative, I'm interested in hearing more about
why this was considered a good idea. I read over the thread you
provided, but it didn't deal with the issues of complexity or
of the JWSDP requirement. Perhaps that's considered a given. If
it is, I hope that the Apache implementations stop deprecating
themselves unless the alternative is clearly better.

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

    At a mobile home park in north Fort Pierce, Timothy Fellows
    emerged from the storm to find a neighbor's trailer demolished
    but only a fence down on his property. "My trailer survived!"
    the barechested Fellows shouted as he walked through his yard.
    "Because I believe in God. Even my mailbox survived. That tells
    you something."

    Ramiro Venegas, an itinerant worker from Mexico, said the storm
    forced him to spend two nights sleeping in a men's toilet at a
    Fort Pierce marina. He had been staying in his girlfriend's car
    until she ditched him two days earlier.

    http://www.sfgate.com/cgi-bin/article.cgi?f=/news/archive/2004/09/05/national1454EDT0524.DTL

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: CDATA section behaviour in XHTML serializer

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hello Murray,

The code you cited appears to be over four years old [1], so this doesn't 
sound like new behaviour. Regardless of when this was introduced, it isn't 
likely to change now. Both the HTMLSerializer and XHTMLSerializer were 
deprecated in Xerces 2.6.2. We've been encouraging users to migrate their 
code to use the standard JAXP Transformation API. If you're interested 
about the future of Xerces serializers see this post [2] to xalan-dev from 
February of this year.

[1] 
http://cvs.apache.org/viewcvs.cgi/xml-xerces/java/src/org/apache/xml/serialize/HTMLSerializer.java?rev=1.10&view=markup
[2] http://marc.theaimsgroup.com/?l=xalan-dev&m=107593381313807&w=2

Murray Altheim <m....@open.ac.uk> wrote on 09/03/2004 05:44:14 PM:

> Hi,
> 
> I searched the archives but was unable to locate any questions
> and answers on this subject, so my apologies if this has been
> covered before.
> 
> In one of the recent releases of Xerces-J, the XHTML serializer's
> behaviour was changed to always escape <script> and <style>
> elements' content using CDATA sections. While this is certainly
> permissable in XHTML (and recommended if the character content of
> the element contains problematic characters, see [1]), this
> behaviour is not always warranted or welcome. It seems that it
> should be an optional serialization behaviour, rather than
> hardwired in, or perhaps made sensitive to the specific character
> data content of the element to be serialized.
> 
> In revision $Revision: 1.26 $ $Date: 2004/02/16 05:24:55 $ of
> org.apache.xml.serializer.HTMLSerializer, on lines 395-404, and
> again on lines 623-632, there's an if statement that automatically
> tags <style> and <script> elements to be escaped as CDATA sections.
> 
>       if ( tagName.equalsIgnoreCase( "SCRIPT" ) ||
>             tagName.equalsIgnoreCase( "STYLE" ) ) {
>            if ( _xhtml ) {
>                // XHTML: Print contents as CDATA section
>                state.doCData = true;
>            } else {
>                // HTML: Print contents unescaped
>                state.unescaped = true;
>            }
>        }
> 
> If just prior to *actually* performing the CDATA escaping, in
> characters(String) and characters(char[],int,int), the character
> data content were checked for the presence of '<' and '&'
> characters, then state.doCData could safely be set false.
> 
> Alternately, a boolean option could be provided to turn off the
> default behaviour, though this latter suggestion does have the
> downside of potentially creating invalid serializations.
> 
> Because these CDATA sections are actually causing problems for
> my own project, I've had to disable the setting of state.doCData
> and just avoid invalid characters in my script and style elements.
> Obviously this is a less than adequate solution, and I'm just
> wondering if others have run across this.
> 
> Thanks,
> 
> Murray
> 
> [1] http://www.w3.org/TR/xhtml1/#h-4.8
> ......................................................................
> Murray Altheim                    http://kmi.open.ac.uk/people/murray/
> Knowledge Media Institute
> The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .
> 
>    The North American Aerospace Defense Command had gone so far
>    as to develop exercises to counter the threat [of flying jets
>    into skyscrapers] and, according to a Defense Department memo-
>    randum unearthed by the commission, planned a drill in April
>    2001 that would have simulated a terrorist crash into the Pentagon.
>    http://www.sfgate.com/cgi-bin/article.cgi?
> file=/c/a/2004/07/25/MNG6S7SR421.DTL
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
> 

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org