You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Radhakrishnan J <ra...@tavant.com> on 2004/11/22 17:49:01 UTC

SAX Parse error while parsing CDATA element

Hi,

We are using XmlBeans and in some places need to generate documents that are
DTD conformant. Sometimes, another XML document needs to be embedded within 
these documents. I tried to use a SAX Parser ( Xerces v 2.2.0 ) to process a
test XML string generated by XmlBeans. The following is the XML ( as seen in
java code ),

  "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
  "<some-content some-attribute=\"some-attribute-value\">" +
  "<!CDATA[some-content-value is: Tom &Jerry is my favorite cartoon show
!]]>" +
  "</some-content>";

XmlBeans generates the following character stream for this ( set as content
for the *GenericString* element ).

<xb:GenericString xmlns:xb="http://tavant/test/xmlbeans">
&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;some-content
some-attribute="some-attribute-value">&lt;!CDATA[some-content-value is: Tom
&amp;Jerry ]]>
&lt;/some-content>
</xb:GenericString>

I wrote a simple SAX handler that just prints what it sees ( I'm new to this
parsing business ). The following is the output :

class org.apache.xerces.jaxp.SAXParserFactoryImpl
class org.apache.xerces.jaxp.SAXParserImpl
setDocumentLocator(org.apache.xerces.parsers.AbstractSAXParser$LocatorProxy@
14c1103)
startDocument()
startElement(,,xb:GenericString,org.apache.xerces.parsers.AbstractSAXParser$
AttributesProxy@1592174)
characters(char[],int,int): <
characters(char[],int,int): <?xm
characters(char[],int,int): <?xml version="1.0" encoding="UTF-8"?>
characters(char[],int,int): <?xml version="1.0" encoding="UTF-8"?><
characters(char[],int,int): <?xml version="1.0"
encoding="UTF-8"?><some-content some-attribute="some-attribute-value">
characters(char[],int,int): <?xml version="1.0"
encoding="UTF-8"?><some-content some-attribute="some-attribute-value"><
characters(char[],int,int): <?xml version="1.0"
encoding="UTF-8"?><some-content
some-attribute="some-attribute-value"><!CDATA[some-content-value is: Tom 
characters(char[],int,int): <?xml version="1.0"
encoding="UTF-8"?><some-content
some-attribute="some-attribute-value"><!CDATA[some-content-value is: Tom &
characters(char[],int,int): <?xml version="1.0"
encoding="UTF-8"?><some-content
some-attribute="some-attribute-value"><!CDATA[some-content-value is: Tom
&Jerry is my favorite cartoo show !
org.xml.sax.SAXParseException: The character sequence "]]>" must not appear
in content unless used to mark the end of a CDATA section.
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
	at
tavant.xmlbeans.test.TestMarkupCharactersAsContent.testGenerateXml(TestMarku
pCharactersAsContent.java:70)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
	at java.lang.reflect.Method.invoke(Method.java:324)
	at junit.framework.TestCase.runTest(TestCase.java:154)
	at junit.framework.TestCase.runBare(TestCase.java:127)
	at junit.framework.TestResult$1.protect(TestResult.java:106)
	at junit.framework.TestResult.runProtected(TestResult.java:124)
	at junit.framework.TestResult.run(TestResult.java:109)
	at junit.framework.TestCase.run(TestCase.java:118)
	at junit.framework.TestSuite.runTest(TestSuite.java:208)
	at junit.framework.TestSuite.run(TestSuite.java:203)
	at junit.textui.TestRunner.doRun(TestRunner.java:116)
	at junit.textui.TestRunner.start(TestRunner.java:172)
	at
com.intellij.rt.execution.junit.TextTestRunner.main(TextTestRunner.java:12)

I don't understand the problem here ?? I tried to search the archives but
the response time was too high. Apologies if this is an oft-repeated query.

Thanks,
Radhakrishnan J
Infrastructure Team
phone: +91-80-51190367
e-mail: radhakrishnan.j@tavant.com

http://kwiki-infra.tavant.com/kwiki/



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: SAX Parse error while parsing CDATA element

Posted by Andy Clark <an...@cyberneko.net>.
Bob Foster wrote:
> A CDATA section can be (mostly) included as CDATA by breaking up the
> ]]> delimiter. For example,
> 
> <![CDATA[text & markup]]>
> 
> Can be wrapped as,
> 
> <![CDATA[<![CDATA[text & markup]]]>]>
> 
> with characters ]> outside any CDATA section as plain text. This 
> technique will work for any sequence of mixed text and markup by 
> escaping out of and back into CDATA sections as many times as
> necessary.

Interesting trick. For those that are interested, here are
the SAX callbacks that Xerces generates for the above example:

   startCDATA()
    characters(text="<![CDATA[text & markup")
    characters(text="]")
   endCDATA()
   characters(text="]")
   characters(text=">")

Simply ignoring the CDATA section boundaries produces a set
of characters that represents the CDATA section as text. Nice.

However, care still needs to be taken when encoding the embedded
content. Using SAX or DOM, you can detect CDATA section boundaries
and properly encode them. Dumping the contents of an XML file
directly into a CDATA section, however, can produce an ill-formed
document.

And simply searching for the text "]]>" within the stream that
is being embedded is not easy due to character encoding issues.
Which means that you should use SAX or DOM to parse the embedded
document in order to properly encode its contents.

All of this work can be avoided, of course, if the document to
be embedded is generated and you can guarantee that it does not
contain the "]]>" sequence.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: SAX Parse error while parsing CDATA element

Posted by Bob Foster <bo...@objfac.com>.
A CDATA section can be (mostly) included as CDATA by breaking up the ]]> 
delimiter. For example,

<![CDATA[text & markup]]>

Can be wrapped as,

<![CDATA[<![CDATA[text & markup]]]>]>

with characters ]> outside any CDATA section as plain text. This 
technique will work for any sequence of mixed text and markup by 
escaping out of and back into CDATA sections as many times as necessary.

Bob Foster

Andy Clark wrote:
> Michael Glavassevich wrote:
> 
>> The sequence "]]>" [1] cannot appear in character data. You should 
>> escape the ">" with &gt;.
> 
> 
> Actually, escaping it wouldn't do any good because the string
> "&gt;" would be passed out, which is probably not intended. In
> short, CDATA sections shouldn't be used if you need to include
> the string "]]>" in your XML document.
> 
> But Jeff Greif has the keen eye on this one. He correctly
> noticed that the start delimiter of the CDATA section is wrong.
> 
> However, Radhakrishnan J should take special care if he is
> going to use CDATA sections to embed documents inside of the
> main XML document. If the embedded document contains a CDATA
> section (or even just the text "]]>"), then it will produce
> a document that is not well-formed.
> 
>> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#NT-CharData



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: SAX Parse error while parsing CDATA element

Posted by Andy Clark <an...@cyberneko.net>.
Michael Glavassevich wrote:
> The sequence "]]>" [1] cannot appear in character data. You should 
> escape the ">" with &gt;.

Actually, escaping it wouldn't do any good because the string
"&gt;" would be passed out, which is probably not intended. In
short, CDATA sections shouldn't be used if you need to include
the string "]]>" in your XML document.

But Jeff Greif has the keen eye on this one. He correctly
noticed that the start delimiter of the CDATA section is wrong.

However, Radhakrishnan J should take special care if he is
going to use CDATA sections to embed documents inside of the
main XML document. If the embedded document contains a CDATA
section (or even just the text "]]>"), then it will produce
a document that is not well-formed.

> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#NT-CharData
-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: SAX Parse error while parsing CDATA element

Posted by Jeff Greif <jg...@alumni.princeton.edu>.
I think there's a left square bracket missing in the CDATA.  It should be
<![CDATA[ .... ]]>

Jeff
----- Original Message ----- 
From: Michael Glavassevich
To: xerces-j-user@xml.apache.org
Sent: Monday, November 22, 2004 9:07 AM
Subject: Re: SAX Parse error while parsing CDATA element



The sequence "]]>" [1] cannot appear in character data. You should escape
the ">" with &gt;.

[1] http://www.w3.org/TR/2004/REC-xml-20040204/#NT-CharData

Radhakrishnan J <ra...@tavant.com> wrote on 11/22/2004 11:49:01
AM:

> Hi,
>
> We are using XmlBeans and in some places need to generate documents that
are
> DTD conformant. Sometimes, another XML document needs to be embedded
within
> these documents. I tried to use a SAX Parser ( Xerces v 2.2.0 ) to process
a
> test XML string generated by XmlBeans. The following is the XML ( as seen
in
> java code ),
>
>   "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
>   "<some-content some-attribute=\"some-attribute-value\">" +
>   "<!CDATA[some-content-value is: Tom &Jerry is my favorite cartoon show
> !]]>" +
>   "</some-content>";


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: SAX Parse error while parsing CDATA element

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
The sequence "]]>" [1] cannot appear in character data. You should escape 
the ">" with &gt;.

[1] http://www.w3.org/TR/2004/REC-xml-20040204/#NT-CharData

Radhakrishnan J <ra...@tavant.com> wrote on 11/22/2004 11:49:01 
AM:

> Hi,
> 
> We are using XmlBeans and in some places need to generate documents that 
are
> DTD conformant. Sometimes, another XML document needs to be embedded 
within 
> these documents. I tried to use a SAX Parser ( Xerces v 2.2.0 ) to 
process a
> test XML string generated by XmlBeans. The following is the XML ( as 
seen in
> java code ),
> 
>   "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
>   "<some-content some-attribute=\"some-attribute-value\">" +
>   "<!CDATA[some-content-value is: Tom &Jerry is my favorite cartoon show
> !]]>" +
>   "</some-content>";
> 
> XmlBeans generates the following character stream for this ( set as 
content
> for the *GenericString* element ).
> 
> <xb:GenericString xmlns:xb="http://tavant/test/xmlbeans">
> &lt;?xml version="1.0" encoding="UTF-8"?>
> &lt;some-content
> some-attribute="some-attribute-value">&lt;!CDATA[some-content-value is: 
Tom
> &amp;Jerry ]]>
> &lt;/some-content>
> </xb:GenericString>
> 
> I wrote a simple SAX handler that just prints what it sees ( I'm new to 
this
> parsing business ). The following is the output :
> 
> class org.apache.xerces.jaxp.SAXParserFactoryImpl
> class org.apache.xerces.jaxp.SAXParserImpl
> 
setDocumentLocator(org.apache.xerces.parsers.AbstractSAXParser$LocatorProxy@
> 14c1103)
> startDocument()
> 
startElement(,,xb:GenericString,org.apache.xerces.parsers.AbstractSAXParser$
> AttributesProxy@1592174)
> characters(char[],int,int): <
> characters(char[],int,int): <?xm
> characters(char[],int,int): <?xml version="1.0" encoding="UTF-8"?>
> characters(char[],int,int): <?xml version="1.0" encoding="UTF-8"?><
> characters(char[],int,int): <?xml version="1.0"
> encoding="UTF-8"?><some-content some-attribute="some-attribute-value">
> characters(char[],int,int): <?xml version="1.0"
> encoding="UTF-8"?><some-content some-attribute="some-attribute-value"><
> characters(char[],int,int): <?xml version="1.0"
> encoding="UTF-8"?><some-content
> some-attribute="some-attribute-value"><!CDATA[some-content-value is: Tom 

> characters(char[],int,int): <?xml version="1.0"
> encoding="UTF-8"?><some-content
> some-attribute="some-attribute-value"><!CDATA[some-content-value is: Tom 
&
> characters(char[],int,int): <?xml version="1.0"
> encoding="UTF-8"?><some-content
> some-attribute="some-attribute-value"><!CDATA[some-content-value is: Tom
> &Jerry is my favorite cartoo show !
> org.xml.sax.SAXParseException: The character sequence "]]>" must not 
appear
> in content unless used to mark the end of a CDATA section.
>    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
>    at
> 
tavant.xmlbeans.test.TestMarkupCharactersAsContent.testGenerateXml(TestMarku
> pCharactersAsContent.java:70)
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at
> 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>    at
> 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>    at java.lang.reflect.Method.invoke(Method.java:324)
>    at junit.framework.TestCase.runTest(TestCase.java:154)
>    at junit.framework.TestCase.runBare(TestCase.java:127)
>    at junit.framework.TestResult$1.protect(TestResult.java:106)
>    at junit.framework.TestResult.runProtected(TestResult.java:124)
>    at junit.framework.TestResult.run(TestResult.java:109)
>    at junit.framework.TestCase.run(TestCase.java:118)
>    at junit.framework.TestSuite.runTest(TestSuite.java:208)
>    at junit.framework.TestSuite.run(TestSuite.java:203)
>    at junit.textui.TestRunner.doRun(TestRunner.java:116)
>    at junit.textui.TestRunner.start(TestRunner.java:172)
>    at
> 
com.intellij.rt.execution.junit.TextTestRunner.main(TextTestRunner.java:12)
> 
> I don't understand the problem here ?? I tried to search the archives 
but
> the response time was too high. Apologies if this is an oft-repeated 
query.
> 
> Thanks,
> Radhakrishnan J
> Infrastructure Team
> phone: +91-80-51190367
> e-mail: radhakrishnan.j@tavant.com
> 
> http://kwiki-infra.tavant.com/kwiki/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
> 

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org