You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Rathi, Pradeep" <pr...@documentum.com> on 2002/09/16 07:10:06 UTC

RE: JAVA: trouble with UTF8 encoding and org.w3c.dom.CharacterDat a.getData()

If you use a reader, then you're essentially converting those bytes to
characters. If you do not specify the encoding, the default encoding is used
which might not be utf-8 and hence the behavior of what you're seeing. You
might be just better off passing a FileInputStream to the InputSource and
not worry about the encodings. The parser will auto detect the encoding and
convert the bytes to the right characters. Alternatively, you can just say 

parser.parse(fileURI) where 'fileURI' is essentially the uri representation
of the file path.

Pradeep

-----Original Message-----
From: SAXESS - Hussayn Dabbous
To: xerces-j-user@xml.apache.org
Sent: 9/15/2002 1:42 PM
Subject: JAVA: trouble with UTF8 encoding and
org.w3c.dom.CharacterData.getData()

Hy, JAVA programmers

I want to read utf8 characters from an XML file using a DOMParser, but
all i get is a set of single bytes. Probably this is a dummies error,
but i don't see the point. Maybe someone can help me ???

I did the following:

1.) I have written a simple XML-file containing utf8 character
encodings:

    +++ begin of file +++++++++++++++++++++++++++++++++++++++++++
    <?xml version="1.0" encoding="UTF-8"?>
    <myxml w="150" h="200" color="FFCCDDEE">
      <text font="Cyberbit Cyberspace" size="13">???</text>
    </myxml>
    +++ end of file +++++++++++++++++++++++++++++++++++++++++++++

    The three characters enclosed in the <text>-tag are in fact three
UTF8 characters.
    when looking at the file with XML-spy, i can see the three
characters.
    when looking at the file with a unix text editor i see 9 bytes in
total there, which
    i have verified to be the correct utf8 encoding. This mail possibly
contains
    only three questionmarks ... ("???")

2.) I read the file using a DOMParser as follows:

    * I create a DOMParser() instance
    * I Create an InputSource(FileReader) instance
    * I create a Document with DOMParser.parse(InputSource)
    * Then i step through the resulting document instance,
      retrieve the Elements, detect the Text, finally
      read Text.getData() to retrieve the textstring.

3.) Now i expect that the text string contains 3 characters, each of
them
    should be a unicode character.
    But all i get is 9 characters, each containing one byte of the utf-8
raw string.
  
i tried encoding="UTF8" but that didn't help.
What's going wrong?

Maybe i should use an InputStream(filename,"UTF-8") instead of a 
FileReader instance ??? (that doesn't sound correct for me ..)


any hint would help.
regards, hussayn

-- 
Dr. Hussayn Dabbous
SAXESS Software Design GmbH
Neuenhöfer Allee 125
50935 Köln
Telefon: +49-221-56011-0
Fax:     +49-221-56011-20
E-Mail:  dabbous@saxess.com


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org