You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Kevin J. Mitchell" <ke...@xmls.com> on 2000/06/26 16:08:06 UTC

CR replacement in CDATA section?

Xerces 1.0.3 appears to be changing CR and CR/LF to LF characters WITHIN a
CDATA section. Should it be doing this? I thought all content within a CDATA
section should be unaltered; in this case, unaffected by the processing laid
out in Section 2.11 of the XML spec. I tried using xml:space="preserve" and
got the same effect. Is there a way to instruct Xerces to leave CDATA
content completely untouched??


[PATCH] BaseMarkupSerializer interpreting newlines (white space) with preserveSpace on

Posted by md...@home.com, md...@home.com.
*** BaseMarkupSerializer.java.orig	Mon Jun 26 19:46:36 2000
--- BaseMarkupSerializer.java	Mon Jun 26 19:47:58 2000
***************
*** 1170,1185 ****
          char ch;
          
          if ( preserveSpace ) {
!             // Preserving spaces: the text must print exactly as it is,
!             // without breaking when spaces appear in the text and without
!             // consolidating spaces. If a line terminator is used, a line
!             // break will occur.
              while ( length-- > 0 ) {
                  ch = chars[ start ];
                  ++start;
!                 if ( ch == '\n' || ch == '\r' )
!                     _printer.breakLine( true );
!                 else if ( unescaped )
                      _printer.printText( ch );
                  else
                      printEscaped( ch );
--- 1170,1182 ----
          char ch;
          
          if ( preserveSpace ) {
!             // Preserving white space: the text must print exactly as it is,
!             // without breaking when white space appears in the text and
!             // without consolidating white space. This includes linebreaks.
              while ( length-- > 0 ) {
                  ch = chars[ start ];
                  ++start;
!                 if ( unescaped )
                      _printer.printText( ch );
                  else
                      printEscaped( ch );
***************
*** 1210,1224 ****
          char ch;
          
          if ( preserveSpace ) {
!             // Preserving spaces: the text must print exactly as it is,
!             // without breaking when spaces appear in the text and without
!             // consolidating spaces. If a line terminator is used, a line
!             // break will occur.
              for ( index = 0 ; index < text.length() ; ++index ) {
                  ch = text.charAt( index );
!                 if ( ch == '\n' || ch == '\r' )
!                     _printer.breakLine( true );
!                 else if ( unescaped )
                      _printer.printText( ch );
                  else
                      printEscaped( ch );
--- 1207,1218 ----
          char ch;
          
          if ( preserveSpace ) {
!             // Preserving white space: the text must print exactly as it is,
!             // without breaking when white space appears in the text and
!             // without consolidating white space. This includes line breaks.
              for ( index = 0 ; index < text.length() ; ++index ) {
                  ch = text.charAt( index );
!                 if ( unescaped )
                      _printer.printText( ch );
                  else
                      printEscaped( ch );

Re: CR replacement in CDATA section?

Posted by Kevin Regan <ke...@valicert.com>.

On Tue, 27 Jun 2000, Andy Clark wrote:

> "Kevin J. Mitchell" wrote:
> > The behavior I got was actually during parsing, not serialization. 
> > The XML document had a CDATA section with lines delimited by CR/LF 
> > (0xD0xA). After parsing thru DOMParser, the resultant CDATA text
> 
> The XML specification mentions in section 2.11 what needs to be
> done to end-of-line characters #xD#xA and #xD. It does not make
> any special case for the contents of CDATA sections.
> 
> > node had just LF.  Adding xml:space="preserve" to the element 
> > containing the CDATA section made no difference. Thanks for the 
> 
> The "xml:space" attribute is only an indication to the application
> about how it should handle the text contained in the element. It
> does not alter how the parser processes whitespace.
> 
> The XML specification makes it clear that all character data must
> be passed to the application. This includes leading and trailing
> whitespace, even when xml:space="default". If you want to strip
> leading/trailing whitespace, then that must be done in the 
> application.
> 

Yup, this was my interpretation of the spec as well.  I don't think
that CDATA sections need to be treated specially in this respect...

Sincerely,
Kevin Regan
kevinr@valicert.com



Re: CR replacement in CDATA section?

Posted by Andy Clark <an...@apache.org>.
"Kevin J. Mitchell" wrote:
> The behavior I got was actually during parsing, not serialization. 
> The XML document had a CDATA section with lines delimited by CR/LF 
> (0xD0xA). After parsing thru DOMParser, the resultant CDATA text

The XML specification mentions in section 2.11 what needs to be
done to end-of-line characters #xD#xA and #xD. It does not make
any special case for the contents of CDATA sections.

> node had just LF.  Adding xml:space="preserve" to the element 
> containing the CDATA section made no difference. Thanks for the 

The "xml:space" attribute is only an indication to the application
about how it should handle the text contained in the element. It
does not alter how the parser processes whitespace.

The XML specification makes it clear that all character data must
be passed to the application. This includes leading and trailing
whitespace, even when xml:space="default". If you want to strip
leading/trailing whitespace, then that must be done in the 
application.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

RE: CR replacement in CDATA section?

Posted by "Kevin J. Mitchell" <ke...@xmls.com>.
The behavior I got was actually during parsing, not serialization. The XML
document had a CDATA section with lines delimited
by CR/LF (0xD0xA). After parsing thru DOMParser, the resultant CDATA text
node had just LF.  Adding xml:space="preserve" to the element containing the
CDATA section made no difference. Thanks for the patch though, as I am
certain we will have a similar issue during serialization.

I agree with your interpretation of CR & LF as whitespace, and that the XML
processor (i.e. parser) should leave them untouched when preserve whitespace
is enabled. On the other hand, 2.11 says that processor must pass LF to
application when it sees CR or CR/LF, and that this CAN be done by
replacement before parsing. I think the ability to override this via the
xml:space="preserve" is a good thing. Can something like this be done in
Xerces?

-----Original Message-----
From: mdusseault@home.com [mailto:mdusseault@home.com]
Sent: Monday, June 26, 2000 11:24 PM
To: xerces-j-dev@xml.apache.org
Subject: Re: CR replacement in CDATA section?


On Mon, 26 Jun 2000, you wrote:
> Xerces 1.0.3 appears to be changing CR and CR/LF to LF characters WITHIN a
> CDATA section. Should it be doing this? I thought all content within a
CDATA
> section should be unaltered; in this case, unaffected by the processing
laid
> out in Section 2.11 of the XML spec. I tried using xml:space="preserve"
and
> got the same effect. Is there a way to instruct Xerces to leave CDATA
> content completely untouched??

That's probably the bug I just found in BaseMarkupSerializer.java.
I haven't gotten a response yet from anybody, but I assume
someone will soon.  Which is fine since I've now had time to look
into it a little more and provide a patch file, which you'll find in my
next message.

You can find my last post about that problem (I think!) with the subject
line
of "Newline bug in BaseMarkupSerializer + a fix"
It was marked June 22nd.  The fix there should work, if all else fails.

However, I had a little time to look at the spec and think about it.

Re: CR replacement in CDATA section?

Posted by md...@home.com, md...@home.com.
On Mon, 26 Jun 2000, you wrote:
> Xerces 1.0.3 appears to be changing CR and CR/LF to LF characters WITHIN a
> CDATA section. Should it be doing this? I thought all content within a CDATA
> section should be unaltered; in this case, unaffected by the processing laid
> out in Section 2.11 of the XML spec. I tried using xml:space="preserve" and
> got the same effect. Is there a way to instruct Xerces to leave CDATA
> content completely untouched??

That's probably the bug I just found in BaseMarkupSerializer.java.
I haven't gotten a response yet from anybody, but I assume
someone will soon.  Which is fine since I've now had time to look
into it a little more and provide a patch file, which you'll find in my
next message.

You can find my last post about that problem (I think!) with the subject line
of "Newline bug in BaseMarkupSerializer + a fix"
It was marked June 22nd.  The fix there should work, if all else fails.

However, I had a little time to look at the spec and think about it.