You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Sripathy Subramania <SS...@viquity.com> on 2001/04/03 22:50:02 UTC
BaseMarkupSerializer bug

Hi,

xerces-1_1_3, BaseMarkupSerializer.characters(char[], int, int)
inserts escape sequence "]]<![CDATA[" for embedded string
pattern "]]>", at the wrong location.
This results in incorrect XML data serialization from the DOM.

I Have proposed a fix in this mail.

Xerces version : 1.1.3
JDK version : 1.3

I had a requirement of serializing the DOM conforming to the
following DTD.

<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT Sample (Id, Messages+)>
<!ELEMENT Id (#PCDATA)>
<!ELEMENT Messages (MsgId, MsgDesc?, Msg)>
<!ELEMENT MsgId (#PCDATA)>
<!ELEMENT MsgDesc (#PCDATA)>
<!ELEMENT Msg (#PCDATA)>

Xml file conforming to this dtd may be
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Sample SYSTEM "Sample.dtd">
<Sample>
  <Id>Doc 1</Id>
  <Messages>
    <MsgId>Msg 1</MsgId>
    <MsgDesc>Testing document</MsgDesc>
    <Msg><![CDATA[This is a test message having patterns ]]>. This message
may cotain multiple occurrences of patterns ]]>. The End]]></Msg>
  </Messages>
</Sample>
  
In the above mentioned DTD, 'Msg' element value will be a
CDATA section. This element value may contain the string "]]>"
embedded in it(as shown in the saple xml document above).
BaseMarkupSerializer identifies this pattern and
escapes it by prepending the string "<![CDATA[", to "]]>". But the
code logic for escaping seems to have a bug.

Original source from
Xerces-1_1_3\src\org\apache\xml\serialize\BaseMarkupSerializer
(Lines 457~491)
*********************************************************
    public void characters( char[] chars, int start, int length )
    {
        ElementState state;
        
        state = content();
        // Check if text should be print as CDATA section or unescaped
        // based on elements listed in the output format (the element
        // state) or whether we are inside a CDATA section or entity.
        
        if ( state.inCData || state.doCData ) {
            int          saveIndent;
            
            // Print a CDATA section. The text is not escaped, but ']]>'
            // appearing in the code must be identified and dealt with.
            // The contents of a text node is considered space
            // preserving.
            if ( ! state.inCData ) {
                _printer.printText( "<![CDATA[" );
                state.inCData = true;
            }
            saveIndent = _printer.getNextIndent();
            _printer.setNextIndent( 0 );
            for ( int index = 0 ; index < length ; ++index ) {
                if ( index + 2 < length && chars[ index ] == ']' && 
                     chars[ index + 1 ] == ']' &&
                     chars[ index + 2 ] == '>') {
                    
                    printText( chars, start, index + 2, true, true );
                    _printer.printText( "]]><![CDATA[" );
                    start += index + 2;
                    length -= index + 2;
                    index = 0;
                }
            }
            if ( length > 0 )
                printText( chars, start, length, true, true );
            _printer.setNextIndent( saveIndent );
*************************************************************
Proposed changes for the above block

    public void characters( char[] chars, int start, int length )
    {
        ElementState state;
        
        state = content();
        // Check if text should be print as CDATA section or unescaped
        // based on elements listed in the output format (the element
        // state) or whether we are inside a CDATA section or entity.
        
        if ( state.inCData || state.doCData ) {
            int          saveIndent;
            int          index = 0;
            int          endIndex = 0;
            
            // Print a CDATA section. The text is not escaped, but ']]>'
            // appearing in the code must be identified and dealt with.
            // The contents of a text node is considered space
            // preserving.
            if ( ! state.inCData ) {
                _printer.printText( "<![CDATA[" );
                state.inCData = true;
            }
            saveIndent = _printer.getNextIndent();
            _printer.setNextIndent( 0 );
            endIndex = start + length;
            for ( index = start ; index < endIndex ; ++index ) {
                if ( index + 2 < endIndex && chars[ index ] == ']' && 
                     chars[ index + 1 ] == ']' &&
                     chars[ index + 2 ] == '>') {
                    
                    printText( chars, start, index + 2 - start,
                               true, true);
                    _printer.printText( "]]><![CDATA[" );
                    start = index + 2;
                    index = start;
                }
            }
            if ( index > start )
                printText( chars, start, index-start, true, true );
            _printer.setNextIndent( saveIndent );
********************************************************************

NOTE : However this fix does not handle the case when the string
       pattern "]]>" does not fall within the buffer boundary.
       This might require more changes.

I checked the source for Xerces-1_2_3 and observed that this bug is
not fixed yet. Moreover I couldn't find mails discussing this problem/fix in
'xerces-j-dev'/'xerces-j-user' mailing list.
I don't know whether this bug has been already identified by the
development team or not.

Would appreciate, if someone familiar with the code can verify the
bug and baseline the changes. Would be glad to provide more
information, in this regard.

Thanks,
-nikhil


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org