You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Peter Qi <Pe...@hummingbird.com> on 2001/03/16 22:37:15 UTC

About international encodings with Xerces-J-1.3.0

Hi there,

In the FAQ, it says that UTF-16 Big Endian is supported. But when I
tried to parse a document with UTF-16BE encoding, the parser said that
UTF-16BE is not supported.  Does anyone have ideas?  Thanks.

Peter Qi

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: About international encodings with Xerces-J-1.3.0

Posted by Peter Qi <Pe...@hummingbird.com>.
Hi Andy,

Thanks for the reply.  I've checked the files in the
org/apache/xerces/readers directory, and changed MIME2Java.java file.  I
also found the error still existed.  I noticed that there are some other
file related to UTF8, such as UTF8Recognizer.java, UTF8Reader.java,
UTF8CharReader.java, so I believe that in order to process UTF-16
encoding, probably some similar files have to be added.  In addition in
UTF8Recognizer.java I found the following codes:
	
	if ("UTF-16".equals(enc)) throw new 	
UnsupportedEncodingException(encname);

Therefore, I think that UTF-16 encoding is not supposed to be supported
in the present version, and not a bug.

Peter Qi 

Andy Clark wrote:
> 
> Peter Qi wrote:
> > After I had the attached two files passed.  The following error messages
> > were generated:
> 
> Okay, now I can reproduce your error. I think that this is just
> a missing mapping in the MIME2Java table used by the parser to
> translate IANA encoding names into Java encoding names. Please
> open a bug to this affect using Bugzilla at:
> 
>   http://nagoya.apache.org/bugzilla/
> 
> In fact, you should put in the bug report that all of the
> defined IANA mappings should be added to the mapping table --
> at least the ones where decoders are present in Java. The URL
> to the list of encodings (and their aliases) is at:
> 
>   http://ww.isi.edu/in-notes/iana/assignments/character-sets
> 
> Please realize that bugs get fixed faster when *you* fix the
> bug and post the patch to the mailing list (as an file
> attachment and not inline).
> 
> Incidentally, your attached XML document wasn't even encoded
> in UTF-16. It was just straight ASCII which would produce an
> error separate from the one that you saw. Please make sure
> that you generate a truly UTF-16 file if you're going to set
> the encoding in the XMLDecl line.
> 
> Some other points:
> 
> 1) Instead of being so specific about the endian-ness of your
>    document (because the parser will determine that by either
>    the BOM or the first few bytes in the file), just use
>    "UTF-16" as your encoding name. (Although, I made this
>    change and still get the same error. Strange...)
> 2) Never put in a link to your DTD like that. Always use either
>    a relative or absolute URI and use an EntityResolver, if
>    needed, to locate the DTD. Otherwise your documents are not
>    portable.
> 
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: About international encodings with Xerces-J-1.3.0

Posted by Andy Clark <an...@apache.org>.
Peter Qi wrote:
> After I had the attached two files passed.  The following error messages
> were generated:

Okay, now I can reproduce your error. I think that this is just
a missing mapping in the MIME2Java table used by the parser to
translate IANA encoding names into Java encoding names. Please
open a bug to this affect using Bugzilla at:

  http://nagoya.apache.org/bugzilla/

In fact, you should put in the bug report that all of the
defined IANA mappings should be added to the mapping table --
at least the ones where decoders are present in Java. The URL
to the list of encodings (and their aliases) is at:

  http://ww.isi.edu/in-notes/iana/assignments/character-sets

Please realize that bugs get fixed faster when *you* fix the
bug and post the patch to the mailing list (as an file
attachment and not inline).

Incidentally, your attached XML document wasn't even encoded
in UTF-16. It was just straight ASCII which would produce an
error separate from the one that you saw. Please make sure
that you generate a truly UTF-16 file if you're going to set
the encoding in the XMLDecl line.

Some other points:

1) Instead of being so specific about the endian-ness of your
   document (because the parser will determine that by either
   the BOM or the first few bytes in the file), just use
   "UTF-16" as your encoding name. (Although, I made this
   change and still get the same error. Strange...)
2) Never put in a link to your DTD like that. Always use either
   a relative or absolute URI and use an EntityResolver, if
   needed, to locate the DTD. Otherwise your documents are not
   portable.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: About international encodings with Xerces-J-1.3.0

Posted by Peter Qi <Pe...@hummingbird.com>.
Hi Andy,

After I had the attached two files passed.  The following error messages
were generated:

Symantec Java! JustInTime Compiler Version 4.00.006(x) for JDK 1.2
(Symantec GC)
Copyright (C) 1996-99 Symantec Corporation

org.xml.sax.SAXParseException: The encoding "UTF-16BE" is not supported.
        at
org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

        at
org.apache.xerces.readers.DefaultEntityHandler.startReadingFromExterna
lEntity(DefaultEntityHandler.java:817)
        at
org.apache.xerces.readers.DefaultEntityHandler.startReadingFromExterna
lSubset(DefaultEntityHandler.java:566)
        at
org.apache.xerces.framework.XMLDTDScanner.scanDoctypeDecl(XMLDTDScanne
r.java:1139)
        at
org.apache.xerces.framework.XMLDocumentScanner.scanDoctypeDecl(XMLDocu
mentScanner.java:2201)
        at
org.apache.xerces.framework.XMLDocumentScanner.access$000(XMLDocumentS
canner.java:86, Compiled Code)
        at
org.apache.xerces.framework.XMLDocumentScanner$PrologDispatcher.dispat
ch(XMLDocumentScanner.java:887, Compiled Code)
        at
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentSc
anner.java:381, Compiled Code)
        at
org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)
        at ParseXTest.parseXML(ParseXTest.java:136)
        at ParseXTest.main(ParseXTest.java:191, Compiled Code)
press any key to exit...


Could you please try to parse the files to see if you can produce the
similar errors?
Thank you.

Peter Qi


Andy Clark wrote:
> 
> Peter Qi wrote:
> > UTF-16BE was used.
> 
> There's some missing files with 1.3.0 so I can't even run the parser
> from the downloadable binary distribution. However, I ran a check
> with Xerces 1.3.1 and could not reproduce your problem. Please check
> your file with the latest version of Xerces and if you still see the
> problem, then most a *minimal* test file.
> 
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: About international encodings with Xerces-J-1.3.0

Posted by Andy Clark <an...@apache.org>.
Peter Qi wrote:
> UTF-16BE was used.

There's some missing files with 1.3.0 so I can't even run the parser
from the downloadable binary distribution. However, I ran a check
with Xerces 1.3.1 and could not reproduce your problem. Please check
your file with the latest version of Xerces and if you still see the
problem, then most a *minimal* test file.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: About international encodings with Xerces-J-1.3.0

Posted by Peter Qi <Pe...@hummingbird.com>.
Hi Andy,

UTF-16BE was used.

Peter

Andy Clark wrote:
> 
> Peter Qi wrote:
> > In the FAQ, it says that UTF-16 Big Endian is supported. But when I
> > tried to parse a document with UTF-16BE encoding, the parser said that
> > UTF-16BE is not supported.  Does anyone have ideas?  Thanks.
> 
> What is the encoding specified in your instance document?
> 
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: About international encodings with Xerces-J-1.3.0

Posted by Andy Clark <an...@apache.org>.
Peter Qi wrote:
> In the FAQ, it says that UTF-16 Big Endian is supported. But when I
> tried to parse a document with UTF-16BE encoding, the parser said that
> UTF-16BE is not supported.  Does anyone have ideas?  Thanks.

What is the encoding specified in your instance document?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org