You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by bu...@apache.org on 2003/10/31 04:12:29 UTC
DO NOT REPLY [Bug 24278] New: - Incorrect SAXException when serializing Œ with UTF-8 encoding

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24278>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24278

Incorrect SAXException when serializing &#338; with UTF-8 encoding

           Summary: Incorrect SAXException when serializing &#338; with UTF-
                    8 encoding
           Product: XalanJ2
           Version: CurrentCVS
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: org.apache.xalan.serialize
        AssignedTo: xalan-dev@xml.apache.org
        ReportedBy: minchau@ca.ibm.com


I got this exception for Xalan-J interpretive (note that this works
just fine with XSTLC):

org.xml.sax.SAXException: 
Attempt to output character of integral value 338 
that is not represented in specified output encoding of .
	at org.apache.xml.serializer.ToTextStream.writeNormalizedChars
(ToTextStream.java:393)
	at org.apache.xml.serializer.ToTextStream.characters
(ToTextStream.java:237)
	at org.apache.xml.utils.FastStringBuffer.sendSAXcharacters
(FastStringBuffer.java:1024)
	at org.apache.xml.dtm.ref.sax2dtm.SAX2DTM.dispatchCharactersEvents
(SAX2DTM.java:599)
. . .

There are two problems. I shouldn't get this message at all, but if I should
then it should have the name of the encoding UTF-8, which it doesn't.

I'm gong to attach a simple XML/XSL pair as a testcase. This problem is in 
ToTextStream and is due to the fix for bug 795 being applied. The else {...} 
clause in writing out a character in ToTextStream:
           if (S_LINEFEED == c && useLineSep)
            {
                writer.write(m_lineSep, 0, m_lineSepLen);
            }
            else if (c <= M_MAXCHARACTER)
            {
                writer.write(c);
            }
            else if (isUTF16Surrogate(c))
            {
                writeUTF16Surrogate(c, ch, i, end);
                i++; // two input characters processed
            }
            else
            {
                String encoding = getEncoding();
                String integralValue = Integer.toString(c);
                throw new SAXException(XMLMessages.createXMLMessage(
                    XMLErrorResources.ER_ILLEGAL_CHARACTER,
                    new Object[]{ integralValue, encoding}));                 
            }
now gives a SAXException, but it used to just write out the character anyways.
The problem is that M_MAXCHARACTER is 127 and the encoding is not set for
the ToTextStream serializer at all.  Should the encoding be set?  I'm not sure 
because this is an intermediate, internal use of a serializer to create a value.
It is not the final serializer, which would be a ToXMLStream one.

Perhaps we need a way to officially signal to a serializer that it doesn't have
to do any escaping or worry about character encoding.  We've had trouble like 
this before where '&' turned into &amp;  then into &amp;amp; because of double 
processing by an intermediate and then a final serializer.  It would be cleaner 
to let a serializer know that it is just an intermediate utility one.  I've 
discussed this with Morris Kwan, but he doesn't think that this is a needed in 
general, probably just for ToTextStream.  

Still we've managed to make the serializer independant of Xalan-J interpretive 
and of XSLTC, I'd like to make the reverse more true and just use the 
serializer by its interface only....  but I'm digressing.