You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Yannis Haralambous <ya...@telecom-bretagne.eu> on 2009/10/05 14:04:27 UTC

Encoding problem

Hi,

I have the following problem:

I wrote a minimal class implementing SAX (I attach it to this  
message). In this class I do the very simple:

     public void characters(char[] ch, int start, int length) throws  
SAXException
     {
       String s = new String(ch, start, length);
        System.out.print(s);
     }

I apply this class to the following minimal document:

<?xml version="1.0" encoding="utf-8"?>
<a>d�g�n�r�</a>

where the "�" characters are coded in UTF-8 (bytes 0xC3 0xA9). When I  
compile the class with the latest version of Xerces-J and run it on  
MacOS X 10.6 I get a very surprising result: the string d�g�n�r�,  
where the "�" characters are represented by the single byte 0x8E. This  
was the position of letter "�" in the old (MacOS 9) encoding MacRoman.

What I don't understand is (a) why does Xerces change the encoding?  
(b) why does it chose a completely obsolete Mac encoding?

I have tried the same class under Windows XP and when I run it under  
Eclipse I get correct UTF-8 output, and when I run it a Windows  
terminal, I get the output in Windows Latin-1 (� is represented by  
byte 0xE9), which is again a 1-byte encoding.

Could you please tell me what to add to my code so that I will always  
obtain UTF-8, regardless of the platform? (I have used this code a few  
years ago, and I never had this problem�)

thanks in advance!

Re: Encoding problem

Posted by ke...@us.ibm.com.
Yeah, that will do it. If you want to fix it at this level, you need to 
set the output stream to use UTF8 encoding rather than the JVM's default 
for that platform.



______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)



Michael Glavassevich <mr...@ca.ibm.com> 
10/05/2009 09:33 AM
Please respond to
j-users@xerces.apache.org


To
j-users@xerces.apache.org
cc

Subject
Re: Encoding problem






FYI: I meant PrintStream.print() [1] though the PrintWriter variant also 
uses the default encoding.

[1] 
http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html#print(java.lang.String)


Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39 AM:

> keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
> 
> > > There is no stylesheet, I'm not using any XSLT file. It is simply 
> > SAX reading the XML file and writing to standard output. 
> > 
> > Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> > gave the wrong answer. 
> > 
> > Can you confirm whether the problem is occurring in the parser or on
> > the the writing-to-standard-output side? 
> > 
> > How are you setting up the SAX serializer? 
> 
> He's not. The code is writing to System.out.println() [1] which 
> always uses the platform's default encoding. One of those Java I/O 
> gotchas folks keep tripping over. Has nothing to do with SAX or Xerces.
> 
> > Or, if you aren't using our serializer, how are you writing to 
> > standard output? 
> >
> > ______________________________________
> > "... Three things see no end: A loop with exit code done wrong,
> > A semaphore untested, And the change that comes along. ..."
> >  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > org/pegasus/songs/threes-rev-11.html)
> 
> Thanks.
> 
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> html#print(java.lang.String)
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org

Re: Encoding problem

Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
It works! Thank you so much Michael!!
Long live Toronto!

Le 5 oct. 2009 � 21:14, Michael Glavassevich a �crit :

> Try calling flush() on the Writer.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Yannis Haralambous <ya...@telecom-bretagne.eu> wrote on  
> 10/05/2009 02:13:00 PM:
>
> > Le 5 oct. 2009 � 17:23, Michael Glavassevich a �crit :
> >
> > Wrap System.out in an OutputStreamWriter:
> >
> > Writer writer = new OutputStreamWriter(System.out, "UTF-8");
> >
> > and call the write() methods on this Writer. Always do this when you
> > want a specific encoding. Relying on the default encoding is a bug
> > waiting to happen even if you think you can control it.
> >
> > Thanks.
>
> >
> > Thank you for the advice
> > After several attempts, when writing
> >
> >     public void characters(char[] ch, int start, int length) throws
> > SAXException
> >     {
> > try
> > {
> > OutputStream out=System.out;
> > OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
> > String s = new String(ch, start, length);
> >        writer.write(s);
> >        }
> > catch (IOException e) {}
> >     }
> >
> > the file compiles, but I don't get any output� What is wrong???
>

--
+ 
-----------------------------------------------------------------------+
| Yannis Haralambous, Ph.D.      yannis.haralambous@telecom- 
bretagne.eu |
| Directeur d'�tudes                      http://omega.enstb.org/ 
yannis |
|                                               twitter :  
y_haralambous |
|                                             Tel. +33 (0) 
2.29.00.14.27 |
|                                             Fax  +33 (0) 
2.29.00.12.82 |
| D�partement  
Informatique                                              |
| T�l�com  
Bretagne                                                      |
| Technop�le de Brest Iroise, CS 83818, 29238 Brest Cedex 3,  
France     |
| Coordonn�es Google-Earth : 48�21'31.57"N  
4�34'16.76"W                 |
+ 
-----------------------------------------------------------------------+
                             ...pour distinguer l'ext�rieur d'un  
aquarium,
                                            mieux vaut n'�tre pas  
poisson

                            ...the ball I threw while playing in the  
park
                                           has not yet reached the  
ground

               Es gab eine Zeit, wo ich nur ungern �ber Schubert  
sprechen,
            nur N�chtens den B�umen und Sternen von ihm vorerz�hlen  
m�gen.




Re: Encoding problem

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Try calling flush() on the Writer.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Yannis Haralambous <ya...@telecom-bretagne.eu> wrote on
10/05/2009 02:13:00 PM:

> Le 5 oct. 2009 à 17:23, Michael Glavassevich a écrit :
>
> Wrap System.out in an OutputStreamWriter:
>
> Writer writer = new OutputStreamWriter(System.out, "UTF-8");
>
> and call the write() methods on this Writer. Always do this when you
> want a specific encoding. Relying on the default encoding is a bug
> waiting to happen even if you think you can control it.
>
> Thanks.

>
> Thank you for the advice
> After several attempts, when writing
>
>     public void characters(char[] ch, int start, int length) throws
> SAXException
>     {
> try
> {
> OutputStream out=System.out;
> OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
> String s = new String(ch, start, length);
>        writer.write(s);
>        }
> catch (IOException e) {},
>     }
>
> the file compiles, but I don't get any output? What is wrong???

Re: Encoding problem

Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
Le 5 oct. 2009 à 17:23, Michael Glavassevich a écrit :

> Wrap System.out in an OutputStreamWriter:
>
> Writer writer = new OutputStreamWriter(System.out, "UTF-8");
>
> and call the write() methods on this Writer. Always do this when you  
> want a specific encoding. Relying on the default encoding is a bug  
> waiting to happen even if you think you can control it.
>
> Thanks.
>
>


Thank you for the advice
After several attempts, when writing

     public void characters(char[] ch, int start, int length) throws  
SAXException
     {
	try
	{
		OutputStream out=System.out;
		OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
		String s = new String(ch, start, length);
	       writer.write(s);
        }
	catch (IOException e) {}
     }

the file compiles, but I don't get any output… What is wrong???







Re: Encoding problem

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Wrap System.out in an OutputStreamWriter:

Writer writer = new OutputStreamWriter(System.out, "UTF-8");

and call the write() methods on this Writer. Always do this when you want a
specific encoding. Relying on the default encoding is a bug waiting to
happen even if you think you can control it.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Yannis Haralambous <ya...@telecom-bretagne.eu> wrote on
10/05/2009 10:34:00 AM:

> thank for you answer
>
> but still there are some things I don't understand:
>
> 1) the locale of my system is:
>
> LANG="fr_FR.UTF-8"
> LC_COLLATE="fr_FR.UTF-8"
> LC_CTYPE="fr_FR.UTF-8"
> LC_MESSAGES="fr_FR.UTF-8"
> LC_MONETARY="fr_FR.UTF-8"
> LC_NUMERIC="fr_FR.UTF-8"
> LC_TIME="fr_FR.UTF-8"
> LC_ALL=
>
> so, clearly, UTF-8 is the default encoding of my platform. If the
> default encoding is not the one of the locale, then what is it?
>
> 2) using MacRoman is an anachronism: that encoding was used in MacOS
> 9, and current MacOS 10.6 has absolutely no way of running MacOS 9
> applications. MacRoman is dead and buried, what I'm seeing is a ghost?
>
> How can I change the behavior of PrintStream.print() ??
>
> thanks in advance
>
> Le 5 oct. 2009 à 15:33, Michael Glavassevich a écrit :?
>
> FYI: I meant PrintStream.print() [1] though the PrintWriter variant
> also uses the default encoding.
>
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintStream.
> html#print(java.lang.String)
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39 AM:
>
> > keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
> >
> > > > There is no stylesheet, I'm not using any XSLT file. It is simply
> > > SAX reading the XML file and writing to standard output.
> > >
> > > Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> > > gave the wrong answer.
> > >
> > > Can you confirm whether the problem is occurring in the parser or on
> > > the the writing-to-standard-output side?
> > >
> > > How are you setting up the SAX serializer?
> >
> > He's not. The code is writing to System.out.println() [1] which
> > always uses the platform's default encoding. One of those Java I/O
> > gotchas folks keep tripping over. Has nothing to do with SAX or Xerces.
> >
> > > Or, if you aren't using our serializer, how are you writing to
> > > standard output?
> > >
> > > ______________________________________
> > > "... Three things see no end: A loop with exit code done wrong,
> > > A semaphore untested, And the change that comes along. ..."
> > >  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > > org/pegasus/songs/threes-rev-11.html)
> >
> > Thanks.
> >
> > [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> > html#print(java.lang.String)
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
>
> --
> +-----------------------------------------------------------------------+
> | Yannis Haralambous, Ph.D.      yannis.haralambous@telecom-bretagne.eu |
> | Directeur d'Études                      http://omega.enstb.org/yannis |
> |                                               twitter : y_haralambous |
> |                                             Tel. +33 (0)2.29.00.14.27 |
> |                                             Fax  +33 (0)2.29.00.12.82 |
> | Département Informatique                                              |
> | Télécom Bretagne                                                      |
> | Technopôle de Brest Iroise, CS 83818, 29238 Brest Cedex 3, France     |
> | Coordonnées Google-Earth : 48°21'31.57"N 4°34'16.76"W                 |
> +-----------------------------------------------------------------------+
>                             ...pour distinguer l'extérieur d'un aquarium,
>                                            mieux vaut n'être pas poisson,
>
>                            ...the ball I threw while playing in the park
>                                           has not yet reached the ground
>
>               Es gab eine Zeit, wo ich nur ungern über Schubert sprechen,
>            nur Nächtens den Bäumen und Sternen von ihm vorerzählen mögen.
> [attachment "Yannis Haralambous.vcf" deleted by Michael
> Glavassevich/Toronto/IBM]

Re: Encoding problem

Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
thank for you answer

but still there are some things I don't understand:

1) the locale of my system is:

LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=

so, clearly, UTF-8 is the default encoding of my platform. If the  
default encoding is not the one of the locale, then what is it?

2) using MacRoman is an anachronism: that encoding was used in MacOS  
9, and current MacOS 10.6 has absolutely no way of running MacOS 9  
applications. MacRoman is dead and buried, what I'm seeing is a ghost�

How can I change the behavior of PrintStream.print() ??

thanks in advance

Le 5 oct. 2009 � 15:33, Michael Glavassevich a �crit :

> FYI: I meant PrintStream.print() [1] though the PrintWriter variant  
> also uses the default encoding.
>
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html#print(java.lang.String)
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39  
> AM:
>
> > keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
> >
> > > > There is no stylesheet, I'm not using any XSLT file. It is  
> simply
> > > SAX reading the XML file and writing to standard output.
> > >
> > > Sorry; I'm used to thinking in terms of Xalan rather than Xerces  
> and
> > > gave the wrong answer.
> > >
> > > Can you confirm whether the problem is occurring in the parser  
> or on
> > > the the writing-to-standard-output side?
> > >
> > > How are you setting up the SAX serializer?
> >
> > He's not. The code is writing to System.out.println() [1] which
> > always uses the platform's default encoding. One of those Java I/O
> > gotchas folks keep tripping over. Has nothing to do with SAX or  
> Xerces.
> >
> > > Or, if you aren't using our serializer, how are you writing to
> > > standard output?
> > >
> > > ______________________________________
> > > "... Three things see no end: A loop with exit code done wrong,
> > > A semaphore untested, And the change that comes along. ..."
> > >  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > > org/pegasus/songs/threes-rev-11.html)
> >
> > Thanks.
> >
> > [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> > html#print(java.lang.String)
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
>

--
+ 
-----------------------------------------------------------------------+
| Yannis Haralambous, Ph.D.      yannis.haralambous@telecom- 
bretagne.eu |
| Directeur d'�tudes                      http://omega.enstb.org/ 
yannis |
|                                               twitter :  
y_haralambous |
|                                             Tel. +33 (0) 
2.29.00.14.27 |
|                                             Fax  +33 (0) 
2.29.00.12.82 |
| D�partement  
Informatique                                              |
| T�l�com  
Bretagne                                                      |
| Technop�le de Brest Iroise, CS 83818, 29238 Brest Cedex 3,  
France     |
| Coordonn�es Google-Earth : 48�21'31.57"N  
4�34'16.76"W                 |
+ 
-----------------------------------------------------------------------+
                             ...pour distinguer l'ext�rieur d'un  
aquarium,
                                            mieux vaut n'�tre pas  
poisson

                            ...the ball I threw while playing in the  
park
                                           has not yet reached the  
ground

               Es gab eine Zeit, wo ich nur ungern �ber Schubert  
sprechen,
            nur N�chtens den B�umen und Sternen von ihm vorerz�hlen  
m�gen.




Re: Encoding problem

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
FYI: I meant PrintStream.print() [1] though the PrintWriter variant also
uses the default encoding.

[1] http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html#print
(java.lang.String)

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Michael Glavassevich/Toronto/IBM@IBMCA wrote on 10/05/2009 09:24:39 AM:

> keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:
>
> > > There is no stylesheet, I'm not using any XSLT file. It is simply
> > SAX reading the XML file and writing to standard output.
> >
> > Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> > gave the wrong answer.
> >
> > Can you confirm whether the problem is occurring in the parser or on
> > the the writing-to-standard-output side?
> >
> > How are you setting up the SAX serializer?
>
> He's not. The code is writing to System.out.println() [1] which
> always uses the platform's default encoding. One of those Java I/O
> gotchas folks keep tripping over. Has nothing to do with SAX or Xerces.
>
> > Or, if you aren't using our serializer, how are you writing to
> > standard output?
> >
> > ______________________________________
> > "... Three things see no end: A loop with exit code done wrong,
> > A semaphore untested, And the change that comes along. ..."
> >  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> > org/pegasus/songs/threes-rev-11.html)
>
> Thanks.
>
> [1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.
> html#print(java.lang.String)
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org

Re: Encoding problem

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
keshlam@us.ibm.com wrote on 10/05/2009 09:19:32 AM:

> > There is no stylesheet, I'm not using any XSLT file. It is simply
> SAX reading the XML file and writing to standard output.
>
> Sorry; I'm used to thinking in terms of Xalan rather than Xerces and
> gave the wrong answer.
>
> Can you confirm whether the problem is occurring in the parser or on
> the the writing-to-standard-output side?
>
> How are you setting up the SAX serializer?

He's not. The code is writing to System.out.println() [1] which always uses
the platform's default encoding. One of those Java I/O gotchas folks keep
tripping over. Has nothing to do with SAX or Xerces.

> Or, if you aren't using our serializer, how are you writing to
> standard output?
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
>  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> org/pegasus/songs/threes-rev-11.html)

Thanks.

[1] http://java.sun.com/javase/6/docs/api/java/io/PrintWriter.html#print
(java.lang.String)

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: Encoding problem

Posted by ke...@us.ibm.com.
> There is no stylesheet, I'm not using any XSLT file. It is simply SAX 
reading the XML file and writing to standard output.

Sorry; I'm used to thinking in terms of Xalan rather than Xerces and gave 
the wrong answer. 

Can you confirm whether the problem is occurring in the parser or on the 
the writing-to-standard-output side?

How are you setting up the SAX serializer? Or, if you aren't using our 
serializer, how are you writing to standard output?

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: Encoding problem

Posted by Yannis Haralambous <ya...@telecom-bretagne.eu>.
There is no stylesheet, I'm not using any XSLT file. It is simply SAX  
reading the XML file and writing to standard output.

Maybe I'm missing some crucial information?

Le 5 oct. 2009 à 14:41, keshlam@us.ibm.com a écrit :

> To guarantee UTF-8 output (assuming the processor is writing  
> directly out to the file rather than producing a SAX or DOM output  
> which other code then writes out), specify the encoding in the  
> stylesheet's <xsl:output> directive.
>
> Though I'd be sorta surprised if UTF-8 isn't the default...
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
>  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.org/pegasus/songs/threes-rev-11.html 
> )

--
+ 
-----------------------------------------------------------------------+
| Yannis Haralambous, Ph.D.      yannis.haralambous@telecom- 
bretagne.eu |
| Directeur d'Études                      http://omega.enstb.org/ 
yannis |
|                                               twitter :  
y_haralambous |
|                                             Tel. +33 (0) 
2.29.00.14.27 |
|                                             Fax  +33 (0) 
2.29.00.12.82 |
| Département  
Informatique                                              |
| Télécom  
Bretagne                                                      |
| Technopôle de Brest Iroise, CS 83818, 29238 Brest Cedex 3,  
France     |
| Coordonnées Google-Earth : 48°21'31.57"N  
4°34'16.76"W                 |
+ 
-----------------------------------------------------------------------+
                             ...pour distinguer l'extérieur d'un  
aquarium,
                                            mieux vaut n'être pas  
poisson

                            ...the ball I threw while playing in the  
park
                                           has not yet reached the  
ground

               Es gab eine Zeit, wo ich nur ungern über Schubert  
sprechen,
            nur Nächtens den Bäumen und Sternen von ihm vorerzählen  
mögen.




Re: Encoding problem

Posted by ke...@us.ibm.com.
To guarantee UTF-8 output (assuming the processor is writing directly out 
to the file rather than producing a SAX or DOM output which other code 
then writes out), specify the encoding in the stylesheet's <xsl:output> 
directive.

Though I'd be sorta surprised if UTF-8 isn't the default...

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)