You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Saw <sa...@hot.ee> on 2004/04/14 21:09:03 UTC

XERCES - SAXParser: newlines in CDATA sections

hi folks.
 
im using xerces 2.1 (sax parser) for XML data parsing.
 
one of the tags in my source XML contains a CDATA section, with a chunk of
text inside that has CRLF newlines.
 
the problem is simple: the parser returns the content of CDATA with all CRLF
newlines converted to LF.
is this behaviour by design ? i thought the content of CDATA should be
untouched by an XML parser.
 
is it a bug or there are some switches to control this behaviour of the
parser ?
 
this behaviour of the parser ruins newlines, i am unable to restore the
original content of CDATA - it is unknown to me whether there was a CRLF or
simply LF initially - they both are LF after parsing.
 
thx.

RE: XERCES - SAXParser: newlines in CDATA sections

Posted by Saw <sa...@hot.ee>.

Thanks everyone.

-----Original Message-----
From: Dean Roddey [mailto:droddey@charmedquark.com] 
Sent: Tuesday, April 20, 2004 12:10 AM
To: xerces-c-dev@xml.apache.org
Subject: RE: XERCES - SAXParser: newlines in CDATA sections

He is correct. The normalization of new lines happens at a low level within
the parser (by design and by the standard) so that the stream of characters
that comes into the parser are already normalized, so clearly that would
mean that even CDATA sections are affected.

The reasoning is that new lines have different forms on different platforms,
and the normalization allows them to be treated consistently by XML parsers
and by the programs that pull data out of the parser after parsing.

If you want these things to come through the parser, you must encode them in
some way that will be ignored by the parser, before feeding the data to the
parser, and then you can expand them (or contract them according to how you
look at it) on the other side when you pull the data out of the parser for
your own use.

Anyway, blame the creators of XML if you have a problem with it, since they
defined how and when normalization is done.

-------------------------------------
Dean Roddey
The Charmed Quark Controller
droddey@charmedquark.com
www.charmedquark.com

-----Original Message-----
From: Saw [mailto:saw@hot.ee]
Sent: Monday, April 19, 2004 12:57 PM
To: xerces-c-dev@xml.apache.org
Subject: RE: XERCES - SAXParser: newlines in CDATA sections

"CDATA sections don't have any affect on normalization:
http://www.w3.org/TR/REC-xml/#sec-cdata-sect"

- I don't see how this follows from the section u've given. That was the
question in the first place, and 2.7 does not answer it, neither did u.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: XERCES - SAXParser: newlines in CDATA sections

Posted by Dean Roddey <dr...@charmedquark.com>.

He is correct. The normalization of new lines happens at a low level within
the parser (by design and by the standard) so that the stream of characters
that comes into the parser are already normalized, so clearly that would
mean that even CDATA sections are affected.

The reasoning is that new lines have different forms on different platforms,
and the normalization allows them to be treated consistently by XML parsers
and by the programs that pull data out of the parser after parsing.

If you want these things to come through the parser, you must encode them in
some way that will be ignored by the parser, before feeding the data to the
parser, and then you can expand them (or contract them according to how you
look at it) on the other side when you pull the data out of the parser for
your own use.

Anyway, blame the creators of XML if you have a problem with it, since they
defined how and when normalization is done.

-------------------------------------
Dean Roddey
The Charmed Quark Controller
droddey@charmedquark.com
www.charmedquark.com
 


-----Original Message-----
From: Saw [mailto:saw@hot.ee] 
Sent: Monday, April 19, 2004 12:57 PM
To: xerces-c-dev@xml.apache.org
Subject: RE: XERCES - SAXParser: newlines in CDATA sections


"CDATA sections don't have any affect on normalization:
http://www.w3.org/TR/REC-xml/#sec-cdata-sect"

- I don't see how this follows from the section u've given. That was the
question in the first place, and 2.7 does not answer it, neither did u.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: XERCES - SAXParser: newlines in CDATA sections

Posted by da...@us.ibm.com.

Well, then you didn't read carefully enough:

http://www.w3.org/TR/REC-xml/#sec-line-ends

   "To simplify the tasks of applications, the XML processor MUST behave as
   if it normalized all line breaks in external parsed entities (including
   the document entity) on input, before parsing, by translating both the
   two-character sequence #xD #xA and any #xD that is not followed by #xA
   to a single #xA character."

Note particulary the two words "before parsing".  That means the parser
must behave as if they were never there.

http://www.w3.org/TR/REC-xml/#sec-cdata-sect"

   "[Definition: CDATA sections MAY occur anywhere character data may
   occur; they are used to escape blocks of text containing characters
   which would otherwise be recognized as markup. CDATA sections begin with
   the string "<![CDATA[" and end with the string "]]>":]"

Since the first part makes clear the characters you want to preserve are
normalized _before_ the parser sees them, it's impossible for a CDATA
section to preserve them.  In addition, isn't the definition of what a
CDATA section is pretty clear?  CR and NL are not markup characters, such
as '&' and '<', etc., so why did you think a CDATA section would preserve
end-of-line characters?  If you want to understand what markup is, then
read:

   http://www.w3.org/TR/REC-xml/#dt-markup

Dave



|---------+--------------------------->
|         |           "Saw"           |
|         |           <sa...@hot.ee>    |
|         |                           |
|         |           04/19/2004 12:57|
|         |           PM              |
|         |           Please respond  |
|         |           to xerces-c-dev |
|---------+--------------------------->
  >--------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                          |
  |        To:      <xe...@xml.apache.org>                                                                            |
  |        cc:      (bcc: David N Bertoni/Cambridge/IBM)                                                                     |
  |        Subject: RE: XERCES - SAXParser: newlines in CDATA sections                                                       |
  >--------------------------------------------------------------------------------------------------------------------------|



"CDATA sections don't have any affect on normalization:
http://www.w3.org/TR/REC-xml/#sec-cdata-sect"

- I don't see how this follows from the section u've given. That was the
question in the first place, and 2.7 does not answer it, neither did u.



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: XERCES - SAXParser: newlines in CDATA sections

Posted by Saw <sa...@hot.ee>.

"CDATA sections don't have any affect on normalization:
http://www.w3.org/TR/REC-xml/#sec-cdata-sect"

- I don't see how this follows from the section u've given. That was the
question in the first place, and 2.7 does not answer it, neither did u.



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: XERCES - SAXParser: newlines in CDATA sections

Posted by da...@us.ibm.com.

> hmm, can someone try this out ? the issue kind of frustrates my
progress...

If you're so frustrated, there are lots of XML resources out on the web.
How about just reading the XML recommendation?

   http://www.w3.org/TR/REC-xml

> one of the tags in my source XML contains a CDATA section, with a chunk
of text inside that has CRLF newlines.
>
> the problem is simple: the parser returns the content of CDATA with all
CRLF newlines converted to LF.
> is this behaviour by design ? i thought the content of CDATA should be
untouched by an XML parser.

Yes, it's by design:

   http://www.w3.org/TR/REC-xml/#sec-line-ends

CDATA sections don't have any affect on normalization:

   http://www.w3.org/TR/REC-xml/#sec-cdata-sect

If you really want CR/LF pairs in content, use numeric character
references:

   &#xD;&#xA;

The parser will not normalize numeric character references.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: XERCES - SAXParser: newlines in CDATA sections

Posted by Saw <sa...@hot.ee>.

hmm, can someone try this out ? the issue kind of frustrates my progress...
thx.

  _____  

From: Saw [mailto:saw@hot.ee] 
Sent: Wednesday, April 14, 2004 10:09 PM
To: xerces-c-dev@xml.apache.org
Subject: XERCES - SAXParser: newlines in CDATA sections

hi folks.

im using xerces 2.1 (sax parser) for XML data parsing.

one of the tags in my source XML contains a CDATA section, with a chunk of
text inside that has CRLF newlines.

the problem is simple: the parser returns the content of CDATA with all CRLF
newlines converted to LF.
is this behaviour by design ? i thought the content of CDATA should be
untouched by an XML parser.

is it a bug or there are some switches to control this behaviour of the
parser ?

this behaviour of the parser ruins newlines, i am unable to restore the
original content of CDATA - it is unknown to me whether there was a CRLF or
simply LF initially - they both are LF after parsing.

thx.