You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by mm...@lycos-inc.com on 2001/04/05 16:41:06 UTC

java strings and encoding question

First off, i'll have to admit i'm using a really old version of xerces, but
i'm noticing something a little peculiar with the handling of data
encodings...

The program someone here wrote basically sucks files in various encodings
into Java Strings and then runs them through xerces using a StringReader
wrapped into an InputSource.  The process of sucking the bytes in from the
file gets them converted from whatever they were in to ucs2 using the
default locale, which is latin1.

Now, if the input is *actually* utf-8, this results in the multi-byte
encodings being broken up and treated as indivdual characters, which is
bad.

My questions are :
1) how is xerces working with String input at all?  Most of these documents
contain the <?xml encoding="iso-8859-1"?> line at the top, which should be
gating how it looks at them, but by the time it's in a String, all of the
document including the declaration line are *actually* in ucs2.  Does
Xerces try to be flexible internally when differentiating between a byte
and a char?  Does it try to equate them, essentially?

2) if #1 is yes, would i get around the problem by adding <?xml encoding
="utf-8"?> explicitly to the documents, engaging this flexibility?

3) without an explicit encoding declaration, does xerces default over to
ucs2 being the default interpretation, rather than utf-8?

thanks
-mark



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: java strings and encoding question

Posted by Andy Clark <an...@apache.org>.
mmodrall@lycos-inc.com wrote:
> 1) how is xerces working with String input at all?  Most of these documents

It has more to do with how Xerces reads from an InputStream vs.
a Reader. In the first case, the input stream is comprised of
bytes and Xerces will do the required thing about auto-detecting
the encoding and then switching appropriately once the XMLDecl
line has been read. However, in the case where you had Xerces a
Reader object, then absolutely NO decoding takes place -- we
already have the Unicode characters so the parser will not use
the encoding declared in the document.

> 2) if #1 is yes, would i get around the problem by adding <?xml encoding
> ="utf-8"?> explicitly to the documents, engaging this flexibility?

Not if you're handing it a StringReader (or any kind of Reader
for that matter).

> 3) without an explicit encoding declaration, does xerces default over to
> ucs2 being the default interpretation, rather than utf-8?

When reading byte streams (not character streams) the default
encoding as given by the XML Spec is UTF-8. Unless, of course,
that something like UTF-16 or EBCDIC was auto-detected...

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org