You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Sripathy Subramania <SS...@viquity.com> on 2001/01/09 01:33:09 UTC
Bug in BaseMarkupSerializer

Hi,

(xerces-1_1_3)org.apache.xml.serialize.BaseMarkupSerializer.characters(char[
], int, int) inserts
escape sequence (']]<![CDATA[')  for end of embedded CDATA (']]>') in wrong
location.
This bug resulted in incorrect XML message with embedded CDATA sections. I
proposed a fix
in the following mail. Would appreciate, if someone familiar with the code
can verify and
baseline the changes.

Xerces version : 1.1.3
JDK version : 1.3

I had a requirement of playing the SAX events(imitating a SAX parser) on a
XMLSerializer
instance to generate XML output. Generated SAX events correspond to a XML
file confirming to the following DTD.

<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT Sample (Id, Messages+)>
<!ELEMENT Id (#PCDATA)>
<!ELEMENT Messages (MsgId, MsgDesc?, Msg)>
<!ELEMENT MsgId (#PCDATA)>
<!ELEMENT MsgDesc (#PCDATA)>
<!ELEMENT Msg (#PCDATA)>

In the above mentioned DTD, the 'Msg' element value may be a nested CDATA
value like
'ABC...XYZ...<![CDATA[
....<![CDATA[......]]>......<![CDATA[.....]]>.......]]>....'.
The series of SAX events generated for the element 'Msg' would be
 1. startCDATA()
 2. startElement()
 3. characters()
 4. endElement()
 5. endCDATA()

There seem to be a bug in BaseMarkupSerializer.java file. The logic for
escaping ']]>' pattern in
embedded CDATA section doesn't seem right.
Original source from
Xerces-1_1_3\src\org\apache\xml\serialize\BaseMarkupSerializer (Lines
457~491)
*********************************************************
    public void characters( char[] chars, int start, int length )
    {
        ElementState state;
        
        state = content();
        // Check if text should be print as CDATA section or unescaped
        // based on elements listed in the output format (the element
        // state) or whether we are inside a CDATA section or entity.
        
        if ( state.inCData || state.doCData ) {
            int          saveIndent;
            
            // Print a CDATA section. The text is not escaped, but ']]>'
            // appearing in the code must be identified and dealt with.
            // The contents of a text node is considered space preserving.
            if ( ! state.inCData ) {
                _printer.printText( "<![CDATA[" );
                state.inCData = true;
            }
            saveIndent = _printer.getNextIndent();
            _printer.setNextIndent( 0 );
            for ( int index = 0 ; index < length ; ++index ) {
                if ( index + 2 < length && chars[ index ] == ']' && 
                     chars[ index + 1 ] == ']' && chars[ index + 2 ] == '>'
) {
                    
                    printText( chars, start, index + 2, true, true );
                    _printer.printText( "]]><![CDATA[" );
                    start += index + 2;
                    length -= index + 2;
                    index = 0;
                }
            }
            if ( length > 0 )
                printText( chars, start, length, true, true );
            _printer.setNextIndent( saveIndent );
*************************************************************
Proposed changes for the above block

    public void characters( char[] chars, int start, int length )
    {
        ElementState state;
        
        state = content();
        // Check if text should be print as CDATA section or unescaped
        // based on elements listed in the output format (the element
        // state) or whether we are inside a CDATA section or entity.
        
        if ( state.inCData || state.doCData ) {
            int          saveIndent;
            int          index = 0;
            int          endIndex = 0;
            
            // Print a CDATA section. The text is not escaped, but ']]>'
            // appearing in the code must be identified and dealt with.
            // The contents of a text node is considered space preserving.
            if ( ! state.inCData ) {
                _printer.printText( "<![CDATA[" );
                state.inCData = true;
            }
            saveIndent = _printer.getNextIndent();
            _printer.setNextIndent( 0 );
            endIndex = start + length;
            for ( index = start ; index < endIndex ; ++index ) {
                if ( index + 2 < endIndex && chars[ index ] == ']' && 
                     chars[ index + 1 ] == ']' && chars[ index + 2 ] == '>'
) {
                    
                    printText( chars, start, index + 2 - start, true, true
);
                    _printer.printText( "]]><![CDATA[" );
                    start = index + 2;
                    index = start;
                }
            }
            if ( index > start )
                printText( chars, start, index-start, true, true );
            _printer.setNextIndent( saveIndent );
********************************************************************
However this fix does not handle the case when the end tag(']]>') does not
fall within the
buffer boundary. This might require more changes.

I checked the source for Xerces-1_2_3 and found that the logic remains the
same and
I couldn't find mails discussing this problem/fix in 'xerces-j-dev' mailing
list.
I don't know whether this bug has been fixed already or not.

Please feel free to contact me if you need more information.
Thanks,
-sripathy