You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Yannis Haralambous <ya...@telecom-bretagne.eu> on 2009/10/05 14:04:27 UTC
Encoding problem
Hi,
I have the following problem:
I wrote a minimal class implementing SAX (I attach it to this
message). In this class I do the very simple:
public void characters(char[] ch, int start, int length) throws
SAXException
{
String s = new String(ch, start, length);
System.out.print(s);
}
I apply this class to the following minimal document:
<?xml version="1.0" encoding="utf-8"?>
<a>d�g�n�r�</a>
where the "�" characters are coded in UTF-8 (bytes 0xC3 0xA9). When I
compile the class with the latest version of Xerces-J and run it on
MacOS X 10.6 I get a very surprising result: the string d�g�n�r�,
where the "�" characters are represented by the single byte 0x8E. This
was the position of letter "�" in the old (MacOS 9) encoding MacRoman.
What I don't understand is (a) why does Xerces change the encoding?
(b) why does it chose a completely obsolete Mac encoding?
I have tried the same class under Windows XP and when I run it under
Eclipse I get correct UTF-8 output, and when I run it a Windows
terminal, I get the output in Windows Latin-1 (� is represented by
byte 0xE9), which is again a 1-byte encoding.
Could you please tell me what to add to my code so that I will always
obtain UTF-8, regardless of the platform? (I have used this code a few
years ago, and I never had this problem�)
thanks in advance!
Re: Encoding problem
Posted by ke...@us.ibm.com.
Yeah, that will do it. If you want to fix it at this level, you need to
set the output stream to use UTF8 encoding rather than the JVM's default
for that platform.
______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
-- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)
Michael Glavassevich <mr...@ca.ibm.com>
10/05/2009 09:33 AM
Please respond to
j-users@xerces.apache.org
To
j-users@xerces.apache.org
cc
Subject
Re: Encoding problem
FYI: I meant PrintStream.print() [1] though the PrintWriter variant also
uses the default encoding.
[1]
http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html#print(java.lang.String)
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39 AM:
> keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
>
> > > There is no stylesheet, I'm not using any XSLT file. It is simply
> > SAX reading the XML file and writing to standard output.
> >
> > Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> > gave the wrong answer.
> >
> > Can you confirm whether the problem is occurring in the parser or on
> > the the writing-to-standard-output side?
> >
> > How are you setting up the SAX serializer?
>
> He's not. The code is writing to System.out.println() [1] which
> always uses the platform's default encoding. One of those Java I/O
> gotchas folks keep tripping over. Has nothing to do with SAX or Xerces.
>
> > Or, if you aren't using our serializer, how are you writing to
> > standard output?
> >
> > ______________________________________
> > "... Three things see no end: A loop with exit code done wrong,
> > A semaphore untested, And the change that comes along. ..."
> > -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > org/pegasus/songs/threes-rev-11.html)
>
> Thanks.
>
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> html#print(java.lang.String)
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
Re: Encoding problem
Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
It works! Thank you so much Michael!!
Long live Toronto!
Le 5 oct. 2009 � 21:14, Michael Glavassevich a �crit :
> Try calling flush() on the Writer.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Yannis Haralambous <ya...@telecom-bretagne.eu> wrote on
> 10/05/2009 02:13:00 PM:
>
> > Le 5 oct. 2009 � 17:23, Michael Glavassevich a �crit :
> >
> > Wrap System.out in an OutputStreamWriter:
> >
> > Writer writer = new OutputStreamWriter(System.out, "UTF-8");
> >
> > and call the write() methods on this Writer. Always do this when you
> > want a specific encoding. Relying on the default encoding is a bug
> > waiting to happen even if you think you can control it.
> >
> > Thanks.
>
> >
> > Thank you for the advice
> > After several attempts, when writing
> >
> > public void characters(char[] ch, int start, int length) throws
> > SAXException
> > {
> > try
> > {
> > OutputStream out=System.out;
> > OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
> > String s = new String(ch, start, length);
> > writer.write(s);
> > }
> > catch (IOException e) {}
> > }
> >
> > the file compiles, but I don't get any output� What is wrong???
>
--
+
-----------------------------------------------------------------------+
| Yannis Haralambous, Ph.D. yannis.haralambous@telecom-
bretagne.eu |
| Directeur d'�tudes http://omega.enstb.org/
yannis |
| twitter :
y_haralambous |
| Tel. +33 (0)
2.29.00.14.27 |
| Fax +33 (0)
2.29.00.12.82 |
| D�partement
Informatique |
| T�l�com
Bretagne |
| Technop�le de Brest Iroise, CS 83818, 29238 Brest Cedex 3,
France |
| Coordonn�es Google-Earth : 48�21'31.57"N
4�34'16.76"W |
+
-----------------------------------------------------------------------+
...pour distinguer l'ext�rieur d'un
aquarium,
mieux vaut n'�tre pas
poisson
...the ball I threw while playing in the
park
has not yet reached the
ground
Es gab eine Zeit, wo ich nur ungern �ber Schubert
sprechen,
nur N�chtens den B�umen und Sternen von ihm vorerz�hlen
m�gen.
Re: Encoding problem
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Try calling flush() on the Writer.
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Yannis Haralambous <ya...@telecom-bretagne.eu> wrote on
10/05/2009 02:13:00 PM:
> Le 5 oct. 2009 à 17:23, Michael Glavassevich a écrit :
>
> Wrap System.out in an OutputStreamWriter:
>
> Writer writer = new OutputStreamWriter(System.out, "UTF-8");
>
> and call the write() methods on this Writer. Always do this when you
> want a specific encoding. Relying on the default encoding is a bug
> waiting to happen even if you think you can control it.
>
> Thanks.
>
> Thank you for the advice
> After several attempts, when writing
>
> public void characters(char[] ch, int start, int length) throws
> SAXException
> {
> try
> {
> OutputStream out=System.out;
> OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
> String s = new String(ch, start, length);
> writer.write(s);
> }
> catch (IOException e) {},
> }
>
> the file compiles, but I don't get any output? What is wrong???
Re: Encoding problem
Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
Le 5 oct. 2009 à 17:23, Michael Glavassevich a écrit :
> Wrap System.out in an OutputStreamWriter:
>
> Writer writer = new OutputStreamWriter(System.out, "UTF-8");
>
> and call the write() methods on this Writer. Always do this when you
> want a specific encoding. Relying on the default encoding is a bug
> waiting to happen even if you think you can control it.
>
> Thanks.
>
>
Thank you for the advice
After several attempts, when writing
public void characters(char[] ch, int start, int length) throws
SAXException
{
try
{
OutputStream out=System.out;
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
String s = new String(ch, start, length);
writer.write(s);
}
catch (IOException e) {}
}
the file compiles, but I don't get any output… What is wrong???
Re: Encoding problem
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Wrap System.out in an OutputStreamWriter:
Writer writer = new OutputStreamWriter(System.out, "UTF-8");
and call the write() methods on this Writer. Always do this when you want a
specific encoding. Relying on the default encoding is a bug waiting to
happen even if you think you can control it.
Thanks.
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Yannis Haralambous <ya...@telecom-bretagne.eu> wrote on
10/05/2009 10:34:00 AM:
> thank for you answer
>
> but still there are some things I don't understand:
>
> 1) the locale of my system is:
>
> LANG="fr_FR.UTF-8"
> LC_COLLATE="fr_FR.UTF-8"
> LC_CTYPE="fr_FR.UTF-8"
> LC_MESSAGES="fr_FR.UTF-8"
> LC_MONETARY="fr_FR.UTF-8"
> LC_NUMERIC="fr_FR.UTF-8"
> LC_TIME="fr_FR.UTF-8"
> LC_ALL=
>
> so, clearly, UTF-8 is the default encoding of my platform. If the
> default encoding is not the one of the locale, then what is it?
>
> 2) using MacRoman is an anachronism: that encoding was used in MacOS
> 9, and current MacOS 10.6 has absolutely no way of running MacOS 9
> applications. MacRoman is dead and buried, what I'm seeing is a ghost?
>
> How can I change the behavior of PrintStream.print() ??
>
> thanks in advance
>
> Le 5 oct. 2009 à 15:33, Michael Glavassevich a écrit :?
>
> FYI: I meant PrintStream.print() [1] though the PrintWriter variant
> also uses the default encoding.
>
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintStream.
> html#print(java.lang.String)
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39 AM:
>
> > keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
> >
> > > > There is no stylesheet, I'm not using any XSLT file. It is simply
> > > SAX reading the XML file and writing to standard output.
> > >
> > > Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> > > gave the wrong answer.
> > >
> > > Can you confirm whether the problem is occurring in the parser or on
> > > the the writing-to-standard-output side?
> > >
> > > How are you setting up the SAX serializer?
> >
> > He's not. The code is writing to System.out.println() [1] which
> > always uses the platform's default encoding. One of those Java I/O
> > gotchas folks keep tripping over. Has nothing to do with SAX or Xerces.
> >
> > > Or, if you aren't using our serializer, how are you writing to
> > > standard output?
> > >
> > > ______________________________________
> > > "... Three things see no end: A loop with exit code done wrong,
> > > A semaphore untested, And the change that comes along. ..."
> > > -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > > org/pegasus/songs/threes-rev-11.html)
> >
> > Thanks.
> >
> > [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> > html#print(java.lang.String)
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
>
> --
> +-----------------------------------------------------------------------+
> | Yannis Haralambous, Ph.D. yannis.haralambous@telecom-bretagne.eu |
> | Directeur d'Études http://omega.enstb.org/yannis |
> | twitter : y_haralambous |
> | Tel. +33 (0)2.29.00.14.27 |
> | Fax +33 (0)2.29.00.12.82 |
> | Département Informatique |
> | Télécom Bretagne |
> | Technopôle de Brest Iroise, CS 83818, 29238 Brest Cedex 3, France |
> | Coordonnées Google-Earth : 48°21'31.57"N 4°34'16.76"W |
> +-----------------------------------------------------------------------+
> ...pour distinguer l'extérieur d'un aquarium,
> mieux vaut n'être pas poisson,
>
> ...the ball I threw while playing in the park
> has not yet reached the ground
>
> Es gab eine Zeit, wo ich nur ungern über Schubert sprechen,
> nur Nächtens den Bäumen und Sternen von ihm vorerzählen mögen.
> [attachment "Yannis Haralambous.vcf" deleted by Michael
> Glavassevich/Toronto/IBM]
Re: Encoding problem
Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
thank for you answer
but still there are some things I don't understand:
1) the locale of my system is:
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
so, clearly, UTF-8 is the default encoding of my platform. If the
default encoding is not the one of the locale, then what is it?
2) using MacRoman is an anachronism: that encoding was used in MacOS
9, and current MacOS 10.6 has absolutely no way of running MacOS 9
applications. MacRoman is dead and buried, what I'm seeing is a ghost�
How can I change the behavior of PrintStream.print() ??
thanks in advance
Le 5 oct. 2009 � 15:33, Michael Glavassevich a �crit :
> FYI: I meant PrintStream.print() [1] though the PrintWriter variant
> also uses the default encoding.
>
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html#print(java.lang.String)
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39
> AM:
>
> > keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
> >
> > > > There is no stylesheet, I'm not using any XSLT file. It is
> simply
> > > SAX reading the XML file and writing to standard output.
> > >
> > > Sorry; I'm used to thinking in terms of Xalan rather than Xerces
> and
> > > gave the wrong answer.
> > >
> > > Can you confirm whether the problem is occurring in the parser
> or on
> > > the the writing-to-standard-output side?
> > >
> > > How are you setting up the SAX serializer?
> >
> > He's not. The code is writing to System.out.println() [1] which
> > always uses the platform's default encoding. One of those Java I/O
> > gotchas folks keep tripping over. Has nothing to do with SAX or
> Xerces.
> >
> > > Or, if you aren't using our serializer, how are you writing to
> > > standard output?
> > >
> > > ______________________________________
> > > "... Three things see no end: A loop with exit code done wrong,
> > > A semaphore untested, And the change that comes along. ..."
> > > -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > > org/pegasus/songs/threes-rev-11.html)
> >
> > Thanks.
> >
> > [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> > html#print(java.lang.String)
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
>
--
+
-----------------------------------------------------------------------+
| Yannis Haralambous, Ph.D. yannis.haralambous@telecom-
bretagne.eu |
| Directeur d'�tudes http://omega.enstb.org/
yannis |
| twitter :
y_haralambous |
| Tel. +33 (0)
2.29.00.14.27 |
| Fax +33 (0)
2.29.00.12.82 |
| D�partement
Informatique |
| T�l�com
Bretagne |
| Technop�le de Brest Iroise, CS 83818, 29238 Brest Cedex 3,
France |
| Coordonn�es Google-Earth : 48�21'31.57"N
4�34'16.76"W |
+
-----------------------------------------------------------------------+
...pour distinguer l'ext�rieur d'un
aquarium,
mieux vaut n'�tre pas
poisson
...the ball I threw while playing in the
park
has not yet reached the
ground
Es gab eine Zeit, wo ich nur ungern �ber Schubert
sprechen,
nur N�chtens den B�umen und Sternen von ihm vorerz�hlen
m�gen.
Re: Encoding problem
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
FYI: I meant PrintStream.print() [1] though the PrintWriter variant also
uses the default encoding.
[1] http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html#print
(java.lang.String)
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39 AM:
> keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
>
> > > There is no stylesheet, I'm not using any XSLT file. It is simply
> > SAX reading the XML file and writing to standard output.
> >
> > Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> > gave the wrong answer.
> >
> > Can you confirm whether the problem is occurring in the parser or on
> > the the writing-to-standard-output side?
> >
> > How are you setting up the SAX serializer?
>
> He's not. The code is writing to System.out.println() [1] which
> always uses the platform's default encoding. One of those Java I/O
> gotchas folks keep tripping over. Has nothing to do with SAX or Xerces.
>
> > Or, if you aren't using our serializer, how are you writing to
> > standard output?
> >
> > ______________________________________
> > "... Three things see no end: A loop with exit code done wrong,
> > A semaphore untested, And the change that comes along. ..."
> > -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > org/pegasus/songs/threes-rev-11.html)
>
> Thanks.
>
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> html#print(java.lang.String)
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
Re: Encoding problem
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
> > There is no stylesheet, I'm not using any XSLT file. It is simply
> SAX reading the XML file and writing to standard output.
>
> Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> gave the wrong answer.
>
> Can you confirm whether the problem is occurring in the parser or on
> the the writing-to-standard-output side?
>
> How are you setting up the SAX serializer?
He's not. The code is writing to System.out.println() [1] which always uses
the platform's default encoding. One of those Java I/O gotchas folks keep
tripping over. Has nothing to do with SAX or Xerces.
> Or, if you aren't using our serializer, how are you writing to
> standard output?
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
> -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> org/pegasus/songs/threes-rev-11.html)
Thanks.
[1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.html#print
(java.lang.String)
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Re: Encoding problem
Posted by ke...@us.ibm.com.
> There is no stylesheet, I'm not using any XSLT file. It is simply SAX
reading the XML file and writing to standard output.
Sorry; I'm used to thinking in terms of Xalan rather than Xerces and gave
the wrong answer.
Can you confirm whether the problem is occurring in the parser or on the
the writing-to-standard-output side?
How are you setting up the SAX serializer? Or, if you aren't using our
serializer, how are you writing to standard output?
______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
-- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)
Re: Encoding problem
Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
There is no stylesheet, I'm not using any XSLT file. It is simply SAX
reading the XML file and writing to standard output.
Maybe I'm missing some crucial information?
Le 5 oct. 2009 à 14:41, keshlam@us.ibm.com a écrit :
> To guarantee UTF-8 output (assuming the processor is writing
> directly out to the file rather than producing a SAX or DOM output
> which other code then writes out), specify the encoding in the
> stylesheet's <xsl:output> directive.
>
> Though I'd be sorta surprised if UTF-8 isn't the default...
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
> -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.org/pegasus/songs/threes-rev-11.html
> )
--
+
-----------------------------------------------------------------------+
| Yannis Haralambous, Ph.D. yannis.haralambous@telecom-
bretagne.eu |
| Directeur d'Études http://omega.enstb.org/
yannis |
| twitter :
y_haralambous |
| Tel. +33 (0)
2.29.00.14.27 |
| Fax +33 (0)
2.29.00.12.82 |
| Département
Informatique |
| Télécom
Bretagne |
| Technopôle de Brest Iroise, CS 83818, 29238 Brest Cedex 3,
France |
| Coordonnées Google-Earth : 48°21'31.57"N
4°34'16.76"W |
+
-----------------------------------------------------------------------+
...pour distinguer l'extérieur d'un
aquarium,
mieux vaut n'être pas
poisson
...the ball I threw while playing in the
park
has not yet reached the
ground
Es gab eine Zeit, wo ich nur ungern über Schubert
sprechen,
nur Nächtens den Bäumen und Sternen von ihm vorerzählen
mögen.
Re: Encoding problem
Posted by ke...@us.ibm.com.
To guarantee UTF-8 output (assuming the processor is writing directly out
to the file rather than producing a SAX or DOM output which other code
then writes out), specify the encoding in the stylesheet's <xsl:output>
directive.
Though I'd be sorta surprised if UTF-8 isn't the default...
______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
-- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)