You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Jesus (John) Salvo Jr." <jo...@softgame.com.au> on 2001/11/26 08:05:26 UTC

Unicode characters written as ? in XML file

Xerces J 1.4.4.

I am trying to convert an existing Unicode text file ( sampleutf16.txt )
into XML.

Attached is my sample program UnicodeTest.java ( Set the first parameter as
the name of the input text file, the second parameter the name of the output
XML file. )

The output ( sample.xml ) that I get is:

<?xml version="1.0" encoding="UTF-8"?>
<trivia-questions>
    <question ask="??2001??????????????????????????"/>
</trivia-questions>

What I was expecting was something like ( for sampleutf16.txt ):

<?xml version="1.0" encoding="UTF-8"?>
<trivia-questions>
    <question ask="&#x622A;&#x6B62;2001&#x5E74.........."/>
</trivia-questions>

( See section 1.1 of http://www.unicode.org/unicode/reports/tr20/ )

I got the "&#x" values using "watch expression [ and show as Hex ]" in
JBuilder while debugging. Also compared that with the hex editor. The sample
program reads in the unicode text file into the variable "line" all fine.


I have also tried reading in an UTF8 file ( sampleutf8.txt ) by replacing
the the following line the UnicodeTest.java from:

    InputStreamReader isr = new InputStreamReader( new FileInputStream(
inputFile ), "UTF-16" ); // You cant use this with sampleutf8.txt

to:

    InputStreamReader isr = new InputStreamReader( new FileInputStream(
inputFile ), "UTF-8" ); // You cant use this with sampleutf16.txt

...with the same results.


What am I doing wrong?


John