You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Andreas Hartmann <an...@apache.org> on 2009/01/21 13:21:35 UTC

Entity escaping in o.a.c.c.serializers.XHTMLSerializer

Hi Cocoon devs,

this issue has already been discussed several times, e.g. [1], but AFAIK 
has not been resolved yet.

The XHTMLSerializer, or, more specifically, the XHMLEncoder, from the 
serializers block in Cocoon 2.1.x escapes all characters with a 
corresponding HTML 4.0 character entity reference into this entity 
reference. This causes issues with inline JavaScript, since e.g. the 
double quotes are transformed to &quot; which causes a JavaScript 
parsing error. Another minor negative effect is the increased document size.

If I understand the W3C correctly, see e.g. [2], the recommended 
approach is to use the character set of the encoding as far as possible,
and use escapes only in exceptional circumstances. I didn't find a 
reason why the XHTMLSerializer uses escapes, but I suspect that it is 
related to browser compatibility issues.

Do you think it would make sense to make this behaviour configurable, e.g.

   <use-entities>true|false</use-entities>

Does the XHTMLSerializer in Cocoon 2.2 show a different behaviour?

TIA for any comments!

-- Andreas


[1] 
http://www.nabble.com/Problem-with-XHTMLSerializers-to1311360.html#a1311360
[2] http://www.w3.org/International/tutorials/tutorial-char-enc/


-- 
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01


Re: Entity escaping in o.a.c.c.serializers.XHTMLSerializer

Posted by Andreas Hartmann <an...@apache.org>.
Andreas Hartmann schrieb:
> Andreas Hartmann schrieb:
>> The XHTMLSerializer, or, more specifically, the XHMLEncoder, from the 
>> serializers block in Cocoon 2.1.x escapes all characters with a 
>> corresponding HTML 4.0 character entity reference into this entity 
>> reference. This causes issues with inline JavaScript, since e.g. the 
>> double quotes are transformed to &quot; which causes a JavaScript 
>> parsing error. Another minor negative effect is the increased document 
>> size.
> 
> I just tried to use the XMLEncoder instead, but it also inserts the 
> entities &quot; and &apos;. Is there a particular reason for this 
> behaviour?

I filed a bug and attached a patch:

   https://issues.apache.org/jira/browse/COCOON-2249

Do you think that it makes sense to apply the patch? If not, is there a 
better solution (apart from using the HTMLSerializer instead, which 
means that the XHTMLSerializer is considered unusable – IMO this would 
send a rather negative signal)?

-- Andreas




-- 
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01


Re: Entity escaping in o.a.c.c.serializers.XHTMLSerializer

Posted by Andreas Hartmann <an...@apache.org>.
Andreas Hartmann schrieb:
> The XHTMLSerializer, or, more specifically, the XHMLEncoder, from the 
> serializers block in Cocoon 2.1.x escapes all characters with a 
> corresponding HTML 4.0 character entity reference into this entity 
> reference. This causes issues with inline JavaScript, since e.g. the 
> double quotes are transformed to &quot; which causes a JavaScript 
> parsing error. Another minor negative effect is the increased document 
> size.

I just tried to use the XMLEncoder instead, but it also inserts the 
entities &quot; and &apos;. Is there a particular reason for this behaviour?

-- Andreas


-- 
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01


Re: Entity escaping in o.a.c.c.serializers.XHTMLSerializer

Posted by Antonio Gallardo <ag...@agssa.net>.
Andreas Hartmann escribió:
> Unfortunately it doesn't work for me. The XHTML source contains the
> NCR for the ' character which also causes a JavaScript error.
>
> To make it work, it would have to look like this:
>
>     private static final char ENCODINGS[][][] = {
>         { { 34 } , "\"".toCharArray() },
>         { { 39 } , "'".toCharArray() },
>
> But this contradicts the very purpose of the XHTMLEncoder, doesn't it?
You are right, it contradicts the mere purpose of the encoding. However
it looks as a valid pragmatic solution.

On a second review it looks there is a better way to handle this issue.
Please just checkout how the HTMLEncoder avoids encoding the of  '
(&apos;) and voids to pass the encoding control to super class XHTMLEncoder.

We should move this code directly to XHTMLEncoder and add in the similar
way an exception for the " (&quot;)

WDYT?

Best Regards,

Antonio Gallardo.


Re: Entity escaping in o.a.c.c.serializers.XHTMLSerializer

Posted by Andreas Hartmann <an...@apache.org>.
Hi Antonio,

thanks for your reply!

Antonio Gallardo schrieb:
> We hit the same issue some years ago and we found a more pragmatic solution:
> 
> In org.apache.cocoon.components.serializers.encoding.XHTMLEncoder add
> the line marked with a + sign:
> 
> 
>     private static final char ENCODINGS[][][] = {
> +    { { 39 } , "&#39;".toCharArray() },
>        { { 160 } , "&nbsp;".toCharArray() },

Actually this patch is already in the 2.1 branch :)

Unfortunately it doesn't work for me. The XHTML source contains the NCR 
for the ' character which also causes a JavaScript error.

To make it work, it would have to look like this:

     private static final char ENCODINGS[][][] = {
         { { 34 } , "\"".toCharArray() },
         { { 39 } , "'".toCharArray() },

But this contradicts the very purpose of the XHTMLEncoder, doesn't it?

-- Andreas


> 
> 
> See:
> http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Entities_representing_special_characters_in_XHTML
> 
> Please let me know if this fix the issue, I will gladly commit the fix.
> 
> Best Regards,
> 
> Antonio Gallardo.
> 
> 
> Andreas Hartmann escribió:
>> Hi Cocoon devs,
>>
>> this issue has already been discussed several times, e.g. [1], but
>> AFAIK has not been resolved yet.
>>
>> The XHTMLSerializer, or, more specifically, the XHMLEncoder, from the
>> serializers block in Cocoon 2.1.x escapes all characters with a
>> corresponding HTML 4.0 character entity reference into this entity
>> reference. This causes issues with inline JavaScript, since e.g. the
>> double quotes are transformed to &quot; which causes a JavaScript
>> parsing error. Another minor negative effect is the increased document
>> size.
>>
>> If I understand the W3C correctly, see e.g. [2], the recommended
>> approach is to use the character set of the encoding as far as possible,
>> and use escapes only in exceptional circumstances. I didn't find a
>> reason why the XHTMLSerializer uses escapes, but I suspect that it is
>> related to browser compatibility issues.
>>
>> Do you think it would make sense to make this behaviour configurable,
>> e.g.
>>
>>   <use-entities>true|false</use-entities>
>>
>> Does the XHTMLSerializer in Cocoon 2.2 show a different behaviour?
>>
>> TIA for any comments!
>>
>> -- Andreas
>>
>>
>> [1]
>> http://www.nabble.com/Problem-with-XHTMLSerializers-to1311360.html#a1311360
>>
>> [2] http://www.w3.org/International/tutorials/tutorial-char-enc/
>>
>>
> 
> 


-- 
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01


Re: Entity escaping in o.a.c.c.serializers.XHTMLSerializer

Posted by Antonio Gallardo <ag...@agssa.net>.
Hi Andreas,

We hit the same issue some years ago and we found a more pragmatic solution:

In org.apache.cocoon.components.serializers.encoding.XHTMLEncoder add
the line marked with a + sign:


    private static final char ENCODINGS[][][] = {
+    { { 39 } , "&#39;".toCharArray() },
       { { 160 } , "&nbsp;".toCharArray() },


See:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Entities_representing_special_characters_in_XHTML

Please let me know if this fix the issue, I will gladly commit the fix.

Best Regards,

Antonio Gallardo.


Andreas Hartmann escribió:
> Hi Cocoon devs,
>
> this issue has already been discussed several times, e.g. [1], but
> AFAIK has not been resolved yet.
>
> The XHTMLSerializer, or, more specifically, the XHMLEncoder, from the
> serializers block in Cocoon 2.1.x escapes all characters with a
> corresponding HTML 4.0 character entity reference into this entity
> reference. This causes issues with inline JavaScript, since e.g. the
> double quotes are transformed to &quot; which causes a JavaScript
> parsing error. Another minor negative effect is the increased document
> size.
>
> If I understand the W3C correctly, see e.g. [2], the recommended
> approach is to use the character set of the encoding as far as possible,
> and use escapes only in exceptional circumstances. I didn't find a
> reason why the XHTMLSerializer uses escapes, but I suspect that it is
> related to browser compatibility issues.
>
> Do you think it would make sense to make this behaviour configurable,
> e.g.
>
>   <use-entities>true|false</use-entities>
>
> Does the XHTMLSerializer in Cocoon 2.2 show a different behaviour?
>
> TIA for any comments!
>
> -- Andreas
>
>
> [1]
> http://www.nabble.com/Problem-with-XHTMLSerializers-to1311360.html#a1311360
>
> [2] http://www.w3.org/International/tutorials/tutorial-char-enc/
>
>