You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@xmlbeans.apache.org by Martin Hamel <ma...@komunide.com> on 2005/03/29 23:01:16 UTC

problem with CDATA handliing

Hi,

I have a case of bad xml. It is an envelope document that includes another 
document. The parser expect the enclosed document to be in CDATA. The problem 
is that the second document now include a third document which is also 
expected to be a CDATA. 


I create document A with an XMLBean. I put it has a text element of document B 
after I transformed Document A to a string with xmlText(). I then do the same 
with document B by putting it in Document C. Everything works well and 
automatically and it creates CDATA everytime it needs to.

        //fragment
 XmlOptions options = new XmlOptions();
        options.setSavePrettyPrint();
        Field field = getAssessmentFields().addNewField();
        field.setFieldName("AssessmentContent");
        field.setFieldValue(answersDocument.xmlText(options));
  ..


The problem is that on the second escaping the CDATA end ([[>)is escaped to 
"&gt;". The SAX parser that read all this (Xalan) just can't do it. Also, the 
specification says that there should not be any CDATA containing a CDATA.

I would like to make a fix in Saver.java in entitizeContent() that would check 
if I have a CDATA in the string. The methods already checks every character. 
If there is a CDATA there it would escape the string with &lt; etc.. instead 
of escaping it with CDATA.

Would you be interested in such a patch? How do I submit it?

Thanks

-- 
Martin Hamel
téléphone: (418)261-2222
clé pgp: 0xA6D61023

Re: problem with CDATA handliing

Posted by Martin Hamel <ma...@komunide.com>.

Le 31 Mars 2005 13:49, Tatu Saloranta a écrit :
> Martin; I'm not sure if I follow this part: why would
> an XML-parser care if something is enclosed as CDATA,
> vs. it being quoted as normal text segment? After all,
> these are equivalent from XML infoset point of view?
> Usually CDATA is used as a convenience feature when
> humans modify XML -- they are not very useful when
> generating documents by code (due to the problem of
> having to detect embedded ']]>' in there, splitting up
> section into 2 pieces; at least unless one just
> ignores the potential problem).
>
> Or perhaps the downstream code something simpler than
> an XML parser (perl script)? If so, I can understand
> the requirement.
>

Hi Tatu,

my problem is that my xml is generated by xmlbeans. Xmlbeans decides to 
enclose my inner xml in CDATA. If I do that once again, I have a CDATA inside 
a CDATA. 

document A
<a>
blabla
</a>

I put document a in document B
<b><![CDATA[<a>
blabla>
]]>
</a>
</b>

I put document B in document C
<c><![CDATA[<b><![CDATA[<a>
blabla>
]]&amp;
</a>
</b>
]]>
</c>

now I get a sax parser (xerces) and extract document B from document C. I get:
<b><![CDATA[<a>
blabla>
]]&gt;
</a>
</b>

the closing of the CDATA is not good (&gt;). We could say that it is a bug of 
the parser. But since the specification says that we should not put a CDATA 
in a CDATA, it is not to blame. I would have prefered if xmlbeans had not 
encoded my document B has CDATA. If things had been just escaped, everything 
would have been fine. That is what the code I submited is doing. Nothing 
much :-)

-- 
Martin Hamel
téléphone: (418)261-2222
clé pgp: 0xA6D61023

Re: problem with CDATA handliing

Posted by Tatu Saloranta <co...@yahoo.com>.

--- Martin Hamel <ma...@komunide.com> wrote:

> > I have a case of bad xml. It is an envelope
> document that includes another
> > document. The parser expect the enclosed document
> to be in CDATA. The
> > problem is that the second document now include a
> third document which is
> > also expected to be a CDATA.

Martin; I'm not sure if I follow this part: why would
an XML-parser care if something is enclosed as CDATA,
vs. it being quoted as normal text segment? After all,
these are equivalent from XML infoset point of view?
Usually CDATA is used as a convenience feature when
humans modify XML -- they are not very useful when
generating documents by code (due to the problem of
having to detect embedded ']]>' in there, splitting up
section into 2 pieces; at least unless one just
ignores the potential problem).

Or perhaps the downstream code something simpler than
an XML parser (perl script)? If so, I can understand
the requirement.

-+ Tatu +-

__________________________________ 
Do you Yahoo!? 
Make Yahoo! your home page 
http://www.yahoo.com/r/hs

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xmlbeans.apache.org
For additional commands, e-mail: dev-help@xmlbeans.apache.org

Re: problem with CDATA handliing

Posted by Martin Hamel <ma...@komunide.com>.

Hi,

Here is the modification I made for embeded CDATA. Do you think that would be 
worty of beeing included?

here is the entitizeContent method in Saver.java:

        Pattern cdataPattern = Pattern.compile("CDATA");


        private void entitizeContent ( )
        {
            if (_lastEmitCch == 0)
                return;

            int i = _lastEmitIn;
            final int n = _buf.length;

            boolean hasOutOfRange = false;
            
            int count = 0;
            for ( int cch = _lastEmitCch ; cch > 0 ; cch-- )
            {                
                char ch = _buf[ i ];

                if (ch == '<' || ch == '&')
                    count++;
                else if (isBadChar( ch ))
                    hasOutOfRange = true;

                if (++i == n)
                    i = 0;
            }

            if (count == 0 && !hasOutOfRange)
                return;

            i = _lastEmitIn;

            //
            // Heuristic for knowing when to save out stuff as a CDATA.
            //
            
            // Well check if we have a cdata in the buffer.
            // If we do, we won't nest another one.
            CharBuffer charBuffer = CharBuffer.wrap(_buf);
            boolean hasCDATA = cdataPattern.matcher(charBuffer).find();            

            if (_lastEmitCch > 32 && count > 5 &&
                    count * 100 / _lastEmitCch > 1 && !hasCDATA)
              { 
                boolean lastWasBracket = _buf[ i ] == ']';

                i = replace( i, "<![CDATA[" + _buf[ i ] );

                boolean secondToLastWasBracket = lastWasBracket;

                lastWasBracket = _buf[ i ] == ']';

                if (++i == _buf.length)
                    i = 0;

                for ( int cch = _lastEmitCch ; cch > 0 ; cch-- )
                {
                    char ch = _buf[ i ];

                    if (ch == '>' && secondToLastWasBracket && lastWasBracket)
                        i = replace( i, "&gt;" );
                    else if (isBadChar( ch ))
                        i = replace( i, "?" );
                    else
                        i++;

                    secondToLastWasBracket = lastWasBracket;
                    lastWasBracket = ch == ']';

                    if (i == _buf.length)
                        i = 0;
                }

                emit( "]]>" );
            }
            else
            {
                for ( int cch = _lastEmitCch ; cch > 0 ; cch-- )
                {
                    char ch = _buf[ i ];

                    if (ch == '<')
                        i = replace( i, "&lt;" );
                    else if (hasCDATA && ch == '>')
                        i = replace(i, "&gt;");
                    else if (ch == '&')
                        i = replace( i, "&amp;" );
                    else if (isBadChar( ch ))
                        i = replace( i, "?" );
                    else
                        i++;

                    if (i == _buf.length)
                        i = 0;
                }
            }
        }


Le 29 Mars 2005 16:01, Martin Hamel a écrit :
> Hi,
>
> I have a case of bad xml. It is an envelope document that includes another
> document. The parser expect the enclosed document to be in CDATA. The
> problem is that the second document now include a third document which is
> also expected to be a CDATA.
>
>
> I create document A with an XMLBean. I put it has a text element of
> document B after I transformed Document A to a string with xmlText(). I
> then do the same with document B by putting it in Document C. Everything
> works well and automatically and it creates CDATA everytime it needs to.
>
>         //fragment
>  XmlOptions options = new XmlOptions();
>         options.setSavePrettyPrint();
>         Field field = getAssessmentFields().addNewField();
>         field.setFieldName("AssessmentContent");
>         field.setFieldValue(answersDocument.xmlText(options));
>   ..
>
>
> The problem is that on the second escaping the CDATA end ([[>)is escaped to
> "&gt;". The SAX parser that read all this (Xalan) just can't do it. Also,
> the specification says that there should not be any CDATA containing a
> CDATA.
>
> I would like to make a fix in Saver.java in entitizeContent() that would
> check if I have a CDATA in the string. The methods already checks every
> character. If there is a CDATA there it would escape the string with &lt;
> etc.. instead of escaping it with CDATA.
>
> Would you be interested in such a patch? How do I submit it?
>
> Thanks

-- 
Martin Hamel
téléphone: (418)261-2222
clé pgp: 0xA6D61023