You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by SAXESS - Hussayn Dabbous <da...@saxess.com> on 2002/09/15 22:42:55 UTC

JAVA: trouble with UTF8 encoding and org.w3c.dom.CharacterData.getData()

Hy, JAVA programmers

I want to read utf8 characters from an XML file using a DOMParser, but
all i get is a set of single bytes. Probably this is a dummies error,
but i don't see the point. Maybe someone can help me ???

I did the following:

1.) I have written a simple XML-file containing utf8 character encodings:

    +++ begin of file +++++++++++++++++++++++++++++++++++++++++++
    <?xml version="1.0" encoding="UTF-8"?>
    <myxml w="150" h="200" color="FFCCDDEE">
      <text font="Cyberbit Cyberspace" size="13">???</text>
    </myxml>
    +++ end of file +++++++++++++++++++++++++++++++++++++++++++++

    The three characters enclosed in the <text>-tag are in fact three UTF8 characters.
    when looking at the file with XML-spy, i can see the three characters.
    when looking at the file with a unix text editor i see 9 bytes in total there, which
    i have verified to be the correct utf8 encoding. This mail possibly contains
    only three questionmarks ... ("???")

2.) I read the file using a DOMParser as follows:

    * I create a DOMParser() instance
    * I Create an InputSource(FileReader) instance
    * I create a Document with DOMParser.parse(InputSource)
    * Then i step through the resulting document instance,
      retrieve the Elements, detect the Text, finally
      read Text.getData() to retrieve the textstring.

3.) Now i expect that the text string contains 3 characters, each of them
    should be a unicode character.
    But all i get is 9 characters, each containing one byte of the utf-8 raw string.
  
i tried encoding="UTF8" but that didn't help.
What's going wrong?

Maybe i should use an InputStream(filename,"UTF-8") instead of a 
FileReader instance ??? (that doesn't sound correct for me ..)


any hint would help.
regards, hussayn

-- 
Dr. Hussayn Dabbous
SAXESS Software Design GmbH
Neuenhöfer Allee 125
50935 Köln
Telefon: +49-221-56011-0
Fax:     +49-221-56011-20
E-Mail:  dabbous@saxess.com


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org