You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Michael Glavassevich (JIRA)" <xe...@xml.apache.org> on 2005/12/15 17:27:46 UTC

[jira] Resolved: (XERCESJ-1122) org.xml.sax.SAXParseException: Content is not allowed in prolog. is thrown out when parser a UTF-8 bom formatted XML file

     [ http://issues.apache.org/jira/browse/XERCESJ-1122?page=all ]
     
Michael Glavassevich resolved XERCESJ-1122:
-------------------------------------------

    Resolution: Invalid

... and it shouldn't be executed.  When you pass a character stream (java.io.Reader) as input to the parser it is your application that has taken responsibility for character decoding not the parser (which never sees the byte stream).

At the beginning of a byte stream U+FEFF is interpreted as a byte order mark (BOM) which is a signature defining the byte order and isn't part of the byte stream's content.  Anywhere else U+FEFF is a regular character called ZERO WIDTH NON-BREAKING SPACE (ZWNBSP) and is part of the byte stream's content.  If there is a BOM at the beginning of the byte stream a java.io.Reader should never return it.  If U+FEFF is ever returned from a reader it must be a ZWNBSP.  You either need to choose a reader which is capable of handling a UTF-8 BOM (the default one in Java appearently can't handle it) or let the parser handle the decoding by passing it a byte stream.

> org.xml.sax.SAXParseException: Content is not allowed in prolog. is thrown out when parser a UTF-8 bom formatted XML file
> -------------------------------------------------------------------------------------------------------------------------
>
>          Key: XERCESJ-1122
>          URL: http://issues.apache.org/jira/browse/XERCESJ-1122
>      Project: Xerces2-J
>         Type: Bug
>   Components: SAX
>     Versions: 2.0.0
>  Environment: Windows XP.
>     Reporter: lin zhu 

>
> The following information is printed when I try to parse an XML file saved in UTF-8 BOM format:
> org.xml.sax.SAXParseException: Content is not allowed in prolog.
> 	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
> 	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
> 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> 	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
> 	at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
> 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
> 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> If I save the very XML file to standard UTF-8 format then everything works smoothly.
> I've tried to debug the source code of Xecers. I found that Xecers actually has codes that handle UTF-8 BOM formatted input stream in class org.apache.xerces.impl.XMLEntityManager.java (@version $Id: XMLEntityManager.java,v 1.94 2005/04/19 03:18:18 mrglavas Exp $). But those codes has never been executed in my case. The following is the detail:
> In public method setupCurrentEntity() in class XMLEntityManager, there are several lines of codes which dealing with UTF-8 BOM format input stream, and all those codes are located in the code block starting from line 929 if ( reader == null ) { .... }. The "reader" in the if clause is defined in line 923. And In my case, the reader would not be assigned a "null" value so that the codes which dealing with UTF-8 BOM would never been executed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org