You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Sasa Bojanic <s....@together.co.yu> on 2003/08/20 10:21:46 UTC

Possible encoding related bug

Hi,

I think that that there is an encoding related bug in Xerces2.5.
When using DOM parser, and trying to parse a document that contains characters that do not belong to the character set that correspond to the specified document encoding (e.g. the character ä is contained in the document which encoding is specified as "us-ascii"), the parser is crashing.
 
Here is the code snippet:
 
      try {
         DOMParser parser = new DOMParser();
         parser.parse(toParse);
      }catch (Exception ex) {
         ex.printStackTrace();
      }

* "toParse" is the path to the following document:

<?xml version="1.0" encoding="us-ascii"?>
<Package Id="pkg1">
  <!-- ä -->
    <PackageHeader>
        <XPDLVersion>1.0</XPDLVersion>
        <Vendor>Together</Vendor>
        <Created>2003-08-20 10:00:49</Created>
    </PackageHeader>
</Package>

The parser crashes because of ä character, and I get the following stack trace:
java.io.IOException: Byte "228" is not a member of the (7-bit) ASCII character set.
        at org.apache.xerces.impl.io.ASCIIReader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XML11EntityScanner.skipSpaces(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at XML.main(XML.java:25)
 
When I use Xerces2.4, everything goes fine!
 
Regards,
Sasa.

Re: Possible encoding related bug

Posted by Michael Glavassevich <mr...@apache.org>.

Hi Sasa,

The parser is working as expected. A bug fix in the ASCIIReader now
rejects any bytes that are not valid US-ASCII. This is in accordance with
the XML rec (http://www.w3.org/TR/REC-xml#charencoding): "It is a fatal
error if an XML entity is determined (via default, encoding declaration,
or higher-level protocol) to be in a certain encoding but contains octet
sequences that are not legal in that encoding."

Since your document is labeled US-ASCII and contains non ASCII bytes, the
document isn't well formed. The ASCII range is Unicode 0-127. Any bytes
outside that range are not members of US-ASCII. The character you intended
to include in your document is part of Latin-1. If you want to correct
your document, one way is to change the encoding attribute to ISO-8859-1.

On Wed, 20 Aug 2003, Sasa Bojanic wrote:

> Hi,
>
> I think that that there is an encoding related bug in Xerces2.5.
> When using DOM parser, and trying to parse a document that contains characters that do not belong to the character set that correspond to the specified document encoding (e.g. the character � is contained in the document which encoding is specified as "us-ascii"), the parser is crashing.
>
> Here is the code snippet:
>
>       try {
>          DOMParser parser = new DOMParser();
>          parser.parse(toParse);
>       }catch (Exception ex) {
>          ex.printStackTrace();
>       }
>
> * "toParse" is the path to the following document:
>
> <?xml version="1.0" encoding="us-ascii"?>
> <Package Id="pkg1">
>   <!-- � -->
>     <PackageHeader>
>         <XPDLVersion>1.0</XPDLVersion>
>         <Vendor>Together</Vendor>
>         <Created>2003-08-20 10:00:49</Created>
>     </PackageHeader>
> </Package>
>
> The parser crashes because of � character, and I get the following stack trace:
> java.io.IOException: Byte "228" is not a member of the (7-bit) ASCII character set.
>         at org.apache.xerces.impl.io.ASCIIReader.read(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
>         at org.apache.xerces.impl.XML11EntityScanner.skipSpaces(Unknown Source)
>         at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
>         at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at XML.main(XML.java:25)
>
> When I use Xerces2.4, everything goes fine!
>
> Regards,
> Sasa.
>

-- 
--------------------
Michael Glavassevich
mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org