You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Tino Schwarze <ap...@tisc.de> on 2006/12/29 17:27:52 UTC

UTF-8 woes

Hi there,

we had an UTF-8 discussion this week, didn't we?

I'm serving content from my open-EIS partition and UTF-8 characters get
garbled. I see that it happends because I build a RDN for the
SearchResult and dhe RdnParser will escape non-ASCII characters in the
values. (Like '\C3\A4' for German Umlaut ä).

JXplorer shows this as %5cC3%5cA4 and sends it back this way when I try
to expand the appropiate node.

Who is to blame and what can I do about that?

Thanks,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de

Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
Ersin Er a écrit :

> On 12/29/06, Emmanuel Lecharny <el...@gmail.com> wrote:
>
>> Ersin Er a écrit :
>>
>> > On 12/29/06, Emmanuel Lecharny <el...@gmail.com> wrote:
>> >
>> >> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
>> >>
>> >> You will have to be a little bit more explicit... How do you build
>> >> your RDN?
>> >> FYI, it is supposed to be a UTF-8 encoded String, so if you are to
>> >> code an
>> >> ä, you will have to :
>> >> - create a byte array containing it's counterpart (0xC3 0xa4) and do
>> >> a new
>> >> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
>> >> - OR do a new RDN( "\u00e4" );
>> >>
>> >> never do a new RDN( "ä" ), because then the String will be 
>> considered as
>> >> ISO-8859-1 encoded  string (at least in Germany or in France, not in
>> >> Turkey
>> >> :)
>> >
>> >
>> > What is the difference between creating an RDN with new RDN( "ä" ) and
>> > with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ?
>>
>> There is a _big_ difference, because your java file might have been
>> saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default
>> encoding of your computer to store the file, and inside this file you
>> have this "ä". There is no guarantee at all that it will be correct when
>> you transform the string to UTF-8 bytes on another computer, using a
>> different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8"
>> ) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 =
>> unicode encoded using bytes), and then it helps to translate the String
>> to UTF-16. Of course, using \u00e4 should be the prefered way if you are
>> to use internal Strings like "This is an umlaut : \u00e4" in your 
>> java file.
>
>
> If your source code file contains "special characters" encoded in X
> encoding, and if you compile that code with javac using the encoding X
> (-encoding X), then there can be no problem. The so called special
> character is safely translated to Java internal encoding. There is no
> UTF-8 related stuff here. The X can be UTF-8 or not, that's all.

Yes, but then you will have to inform all the users about the encoding 
used when you have saved the java file. And trust me, people in Korea 
are not using ISO8859-1 encoding, and have no idea what can be a "ä"... 
The reverse is also true :)
Using the -encoding X is overkilling, IMHO. It's much more preferable to 
declare those special chars using the '\uxxxx' notation, or for 
international strings, to use and external property files, with all the 
foreign languages if needed (_FR, _DE, .. proerty files).

>
> You can create your source code with ISO-8859-1, and safely compile it
> without the encoding option while your platform encoding is
> ISO-8859-1. The special characters will be converted to safe Java
> UTF-16 forms. But if you send it to me, and if my platform encoding is
> ISO-8859-9 (Turkish), and if I compile it with just javac (no encoding
> option), the strings will be malformed (but will still compile). If I
> give the option -encoding ISO-8859-1 to the compiler, there will be no
> problem. There is still no problem related to UTF-8 here.

I didn't say that there were a pb with UTF-8. UTF-8 is just a way to 
encode Unicode using bytes. But, yes, you are right, given that you 
_know_ that I have used ISO-8859-1 encoding to write my file, then you 
just have to use -x ISO-8859-1 flag to compile it on your platform. But 
I hope you know which encoding is using Trustin, or any other people in 
the world not living in western europ or USA :) A little bit cumbersome, 
isn't it ?

Whatever, this should not be a problem for us. Again, if you have to use 
special chars in your code, use '\uxxxx' notation, for the good of all 
other people. If it's for messages, then I18n is you friend. And 
whatever encoding your file (ISO-8859-1 or -2 or -xxx) will be ok so 
far, as you will just use US ASCII chars, so -X encoding flag will be 
useless ;)

oh, btw, Unicode is really a mess, did I already said that ? :)

Emmanuel

Re: UTF-8 woes

Posted by Ersin Er <er...@gmail.com>.
On 12/29/06, Emmanuel Lecharny <el...@gmail.com> wrote:
> Ersin Er a écrit :
>
> > On 12/29/06, Emmanuel Lecharny <el...@gmail.com> wrote:
> >
> >> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
> >>
> >> You will have to be a little bit more explicit... How do you build
> >> your RDN?
> >> FYI, it is supposed to be a UTF-8 encoded String, so if you are to
> >> code an
> >> ä, you will have to :
> >> - create a byte array containing it's counterpart (0xC3 0xa4) and do
> >> a new
> >> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
> >> - OR do a new RDN( "\u00e4" );
> >>
> >> never do a new RDN( "ä" ), because then the String will be considered as
> >> ISO-8859-1 encoded  string (at least in Germany or in France, not in
> >> Turkey
> >> :)
> >
> >
> > What is the difference between creating an RDN with new RDN( "ä" ) and
> > with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ?
>
> There is a _big_ difference, because your java file might have been
> saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default
> encoding of your computer to store the file, and inside this file you
> have this "ä". There is no guarantee at all that it will be correct when
> you transform the string to UTF-8 bytes on another computer, using a
> different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8"
> ) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 =
> unicode encoded using bytes), and then it helps to translate the String
> to UTF-16. Of course, using \u00e4 should be the prefered way if you are
> to use internal Strings like "This is an umlaut : \u00e4" in your java file.

If your source code file contains "special characters" encoded in X
encoding, and if you compile that code with javac using the encoding X
(-encoding X), then there can be no problem. The so called special
character is safely translated to Java internal encoding. There is no
UTF-8 related stuff here. The X can be UTF-8 or not, that's all.

You can create your source code with ISO-8859-1, and safely compile it
without the encoding option while your platform encoding is
ISO-8859-1. The special characters will be converted to safe Java
UTF-16 forms. But if you send it to me, and if my platform encoding is
ISO-8859-9 (Turkish), and if I compile it with just javac (no encoding
option), the strings will be malformed (but will still compile). If I
give the option -encoding ISO-8859-1 to the compiler, there will be no
problem. There is still no problem related to UTF-8 here.

A mini reference: http://www.jorendorff.com/articles/unicode/java.html

> > There is
> > nothing as "UTF-8" String in Java.
>
> When you write new String( <some bytes>, "UTF-8" ), you just tell the
> JVM that the byte array is supposed to be a UTF-8 encoded String. It
> will trasnlate those bytes to UTF-16 chars, using one or two char if
> needed (Unicode can use up to 2^32 bits). For instance, the é in my name
> as a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. If
> you don't tell String() that the bytes array is UTF-8 encoded, then it
> will just consider that the byte array is using the default platform
> encoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '(c)', so you have
> now a Java String with is 2 chars long instead of one char long...



> > All strings are UTF-16. You can get
> > their representations in other encodings as byte arrays. So when you
> > do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
> > am I missing here?
>
> It is transformed to UTF-16 accordingling to the encoding used on your
> platform. But then, if your local encoding is ISO-8859-1, when doing a
> String.getBytes( "UTF-8" ), you might have something very different to
> that you were expecting.
>
> Ok, this is not simple. A simple rule then :
> *always use \uxxxx when encoding non ASCII characters in a java file*
>
> >
> > (Not being able to display the character in source code in other
> > platforms is a different matter. It's about the text editor encoding.)
>
> yes, but you always use an editor to write your java file...
>
> At this point, I may also miss something, but I would then like to have
> more informations like a test case which expose the problem.
>
> Emmanuel.
>


-- 
Ersin

Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
Tino Schwarze a écrit :

>On Fri, Dec 29, 2006 at 08:37:07PM +0100, Emmanuel Lecharny wrote:
>
>  
>
>>There is a _big_ difference, because your java file might have been 
>>saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default 
>>encoding of your computer to store the file, and inside this file you 
>>    
>>
>
>I was not talking about characters from a Java source, the data came
>directly from a database. The test case is pretty simple:
>
>    byte[] _bytes = new byte[] { (byte)0xc3, (byte)0xa4, (byte)0xc3, (byte)0xb6, (byte)0xc3, (byte)0xbc };
>      
>    String _s = new String (_bytes, "utf-8");
>    
>    // this should print äöü
>    System.out.println (_s);
>    LdapDN _dnTest = new LdapDN();
>    _dnTest.add ("cn=Uml"+_s);
>    System.out.println (_dnTest.toString());
>  
>-> this will get mixed up instantly by the RdnParser - it converts every
>non-ASCII byte to \xy encoding (I'm not sure who's doing the "\" to "%5C"
>conversion later).
>  
>
Great, and thanks for the test case. I gonna inject it into the junit 
tests, and give you a feedback in a few moment. If it's a bug in RDN, 
then let's fix it ! :)

Emmanuel


Re: LDAP Studio (was: Re: UTF-8 woes)

Posted by Emmanuel Lecharny <el...@gmail.com>.
This is maybe because you give an empty base DN. Doing so will get you the
rootDSE. Try to enter 'ou=system' or 'o=community4you.de' as a base DN.

I don't remember if there is a way to get the list of possible DIT as in
LdapBrowser (Fetch DN stuff).

If not, then this is something missing.

On 1/2/07, Tino Schwarze <ap...@tisc.de> wrote:
>
> On Tue, Jan 02, 2007 at 01:40:07PM +0100, Emmanuel Lecharny wrote:
>
> > If you want to test bleeding-edge software, just try LdapStudio :
> > http://cwiki.apache.org/DIRxSTUDIO/
> >
> > It can be found in trunks (ldapstudio sub-project) and it works either
> as an
> > eclipse plugin or as an RCP application.
>
> The RCP application is working fine, but I can't access my test server.
> I can access an existing OpenLDAP server correctly, but if I access my
> ApacheDS test server, it doesn't show me the
>   ou=system
>   o=community4you.de
> root entries, but just a "RootDSE" which cannot be expanded any further.
> Neither JXplorer nor GQ had any difficulties with this. I'm in the
> Browser perspective btw.
>
> BTW: I can expand my tree and see "Aufträge" in GQ correctly, but fail
> to expand any sub-entry below (not only below Aufträge). My partition
> doesn't seem to be called at all.
>
> Thanks for hints,
>
> Tino.
>
> --
> www.quantenfeuerwerk.de
> www.spiritualdesign-chemnitz.de
> www.lebensraum11.de
>



-- 
Cordialement,
Emmanuel Lécharny
www.iktek.com

LDAP Studio (was: Re: UTF-8 woes)

Posted by Tino Schwarze <ap...@tisc.de>.
On Tue, Jan 02, 2007 at 01:40:07PM +0100, Emmanuel Lecharny wrote:

> If you want to test bleeding-edge software, just try LdapStudio :
> http://cwiki.apache.org/DIRxSTUDIO/
> 
> It can be found in trunks (ldapstudio sub-project) and it works either as an
> eclipse plugin or as an RCP application.

The RCP application is working fine, but I can't access my test server.
I can access an existing OpenLDAP server correctly, but if I access my
ApacheDS test server, it doesn't show me the
  ou=system
  o=community4you.de
root entries, but just a "RootDSE" which cannot be expanded any further.
Neither JXplorer nor GQ had any difficulties with this. I'm in the
Browser perspective btw.

BTW: I can expand my tree and see "Aufträge" in GQ correctly, but fail
to expand any sub-entry below (not only below Aufträge). My partition
doesn't seem to be called at all.

Thanks for hints,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de

Re: LDAP Studio (was: Re: UTF-8 woes)

Posted by Emmanuel Lecharny <el...@gmail.com>.
oops, I didn't noticed that the link was broken... Fixed !

Thanks for the heads up !

On 1/2/07, Tino Schwarze <ap...@tisc.de> wrote:
>
> On Tue, Jan 02, 2007 at 01:40:07PM +0100, Emmanuel Lecharny wrote:
> > Yeah, JExplorer might be buggy...
> >
> > If you want to test bleeding-edge software, just try LdapStudio :
> > http://cwiki.apache.org/DIRxSTUDIO/
>
> Thanks, I'll have a look. BTW: Ivy seems to be a bit difficult to get,
> currently. The download-link in the Wiki is broken. This one works:
> http://www.jaya.free.fr/ivy/download.html
>
> HTH,
>
> Tino.
>
> --
> www.quantenfeuerwerk.de
> www.spiritualdesign-chemnitz.de
> www.lebensraum11.de
>



-- 
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: LDAP Studio (was: Re: UTF-8 woes)

Posted by Emmanuel Lecharny <el...@gmail.com>.
Yes, Ivy repos is sometime down or overloaded. But as Ivy is becoming an
apache project (being incubated atm), this will soon not be a problem
anymore :)

On 1/2/07, Tino Schwarze <ap...@tisc.de> wrote:
>
> On Tue, Jan 02, 2007 at 01:40:07PM +0100, Emmanuel Lecharny wrote:
> > Yeah, JExplorer might be buggy...
> >
> > If you want to test bleeding-edge software, just try LdapStudio :
> > http://cwiki.apache.org/DIRxSTUDIO/
>
> Thanks, I'll have a look. BTW: Ivy seems to be a bit difficult to get,
> currently. The download-link in the Wiki is broken. This one works:
> http://www.jaya.free.fr/ivy/download.html
>
> HTH,
>
> Tino.
>
> --
> www.quantenfeuerwerk.de
> www.spiritualdesign-chemnitz.de
> www.lebensraum11.de
>



-- 
Cordialement,
Emmanuel Lécharny
www.iktek.com

LDAP Studio (was: Re: UTF-8 woes)

Posted by Tino Schwarze <ap...@tisc.de>.
On Tue, Jan 02, 2007 at 01:40:07PM +0100, Emmanuel Lecharny wrote:
> Yeah, JExplorer might be buggy...
> 
> If you want to test bleeding-edge software, just try LdapStudio :
> http://cwiki.apache.org/DIRxSTUDIO/

Thanks, I'll have a look. BTW: Ivy seems to be a bit difficult to get,
currently. The download-link in the Wiki is broken. This one works:
http://www.jaya.free.fr/ivy/download.html

HTH,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de

Re: UTF-8 woes

Posted by Norval Hope <nr...@gmail.com>.
> Now I have correct SearchResults with correct names in them. But
> JXplorer still shows "Auftr%c3%a4ge" instead of Aufträge, so I still
> need a decodeValue method before searching. Looking at the wire protocol
> with Ethereal, I see that everything is correctly transmitted in UTF-8,
> so I suppose, it's a JXplorer issue.
>
> Are there other issues known with UTF-8 characters and LDAP clients?
>

Sounds possible that you are running into a problem due to a DN being
expressed as part of an LDAP URL. The Sun JNDI client uses URLs in
search results if you do a search under a base and then get search
results with DNs which don't fall under this base (I have seen this
problem when I have bugs writing search results from a custom
partition). In particular one of these things need to be true:
    1. search results need to start with the base DN used for the search or
     2. search results need to have isRelative(true).

When neither of these two things is true the Sun JNDI client seems to
assume that the result must have resulted from a referral and a DN
like "name=fred, dc=acme" becomes "ldap://host:port//name=fred,
dc=acme" and URL encoding is applied to RDN value.

JX then tries to represent the LDAP URL in a sensible format an you
get extraneous % chars from the URL encoding.

Maybe this is what you are seeing.

Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
Yeah, JExplorer might be buggy...

If you want to test bleeding-edge software, just try LdapStudio :
http://cwiki.apache.org/DIRxSTUDIO/

It can be found in trunks (ldapstudio sub-project) and it works either as an
eclipse plugin or as an RCP application.

It has not been released yet, but will be soon.

Happy new year !

On 1/2/07, Tino Schwarze <ap...@tisc.de> wrote:
>
> On Sat, Dec 30, 2006 at 02:25:21PM +0100, Emmanuel Lecharny wrote:
>
> > >>In the meantime, ust use getUpName().
>
> > >I'll try that and see what I get (next year).
>
> > FYI, I have tested you test case using getUpName() and it works.
>
> Now I have correct SearchResults with correct names in them. But
> JXplorer still shows "Auftr%c3%a4ge" instead of Aufträge, so I still
> need a decodeValue method before searching. Looking at the wire protocol
> with Ethereal, I see that everything is correctly transmitted in UTF-8,
> so I suppose, it's a JXplorer issue.
>
> Are there other issues known with UTF-8 characters and LDAP clients?
>
> Bye,
>
> Tino.
>
> --
> www.quantenfeuerwerk.de
> www.spiritualdesign-chemnitz.de
> www.lebensraum11.de
>



-- 
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: UTF-8 woes

Posted by Tino Schwarze <ap...@tisc.de>.
On Sat, Dec 30, 2006 at 02:25:21PM +0100, Emmanuel Lecharny wrote:

> >>In the meantime, ust use getUpName().

> >I'll try that and see what I get (next year).

> FYI, I have tested you test case using getUpName() and it works.

Now I have correct SearchResults with correct names in them. But
JXplorer still shows "Auftr%c3%a4ge" instead of Aufträge, so I still
need a decodeValue method before searching. Looking at the wire protocol
with Ethereal, I see that everything is correctly transmitted in UTF-8,
so I suppose, it's a JXplorer issue.

Are there other issues known with UTF-8 characters and LDAP clients?

Bye,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de

Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
Tino Schwarze a écrit :

>Hi Emmanuel,
>
>On Sat, Dec 30, 2006 at 01:12:48AM +0100, Emmanuel Lecharny wrote:
>  
>
>>I can' believe as was _so wrong_ !!! This method is just used aroung 900 
>>times in the server. Modifying it would break it completly, and would 
>>take days to fix it back...
>>    
>>
>
>Thank your for your evaluation. BTW: Are you sure that
>LdapDN.toString() is called (I've got no dev setup here at home, can't
>check) - e.g. Eclipse shows all calls to Object.toString() if you're
>displaying the Callers of LdapDN.toString().
>  
>
The probem is that LdapDN implements Name, so if you search for 
LdapDN.toString(), you will find nothing. However, as soon as you enter 
the server, we use the interface instead of LdapDN, and we call 
Name.toString() method, which internally _is_ LdapDN.toString() method.

This is why I initially thought that LdapDN.toString() was never called...

>  
>
>>In the meantime, ust use getUpName().
>>    
>>
>
>I'll try that and see what I get (next year).
>  
>
FYI, I have tested you test case using getUpName() and it works.

>  
>
>>Again, sorry. As a defense, I will argue that LdapDN is _not_ intended 
>>to be used from the outside of the server (yes, I know, this is a 
>>pathetic excuse :-/ )
>>    
>>
>
>I'm not outside the server, I'm inside. ;-)
>  
>
ok, then, getUpName is the method to use. I have created an issue about 
changing the toString() semantic, but this is definitively not an easy 
task. As I said, there are around 900 places in the code where we call 
toString()...

>Bye,
>
>Tino.
>  
>
Happy new year :)

Emmanuel


Re: UTF-8 woes

Posted by Tino Schwarze <ap...@tisc.de>.
Hi Emmanuel,

On Sat, Dec 30, 2006 at 01:12:48AM +0100, Emmanuel Lecharny wrote:

> >Ok, I get it now. First, apologize to have so badly name this method 
> >... toString() is such a pathetic name when it does not give you back 
> >what you are expecting... Just try this :
> >
> >System.out.println ( _dnTest.getUpName() );
> >
> >you will have your expected result. The toString() method is supposed 
> >(well, whip me, it's terribly bad), to return a String for debug 
> >usage. The worst thing is that this method is _never_ used anywhere in 
> >the server, except in unit tests ;(
> 
> I can' believe as was _so wrong_ !!! This method is just used aroung 900 
> times in the server. Modifying it would break it completly, and would 
> take days to fix it back...

Thank your for your evaluation. BTW: Are you sure that
LdapDN.toString() is called (I've got no dev setup here at home, can't
check) - e.g. Eclipse shows all calls to Object.toString() if you're
displaying the Callers of LdapDN.toString().

> In the meantime, ust use getUpName().

I'll try that and see what I get (next year).

> Again, sorry. As a defense, I will argue that LdapDN is _not_ intended 
> to be used from the outside of the server (yes, I know, this is a 
> pathetic excuse :-/ )

I'm not outside the server, I'm inside. ;-)

Bye,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de

Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
Emmanuel Lecharny a écrit :

> Tino Schwarze a écrit :
>
>> <snip/>
>
> Ok, I get it now. First, apologize to have so badly name this method 
> ... toString() is such a pathetic name when it does not give you back 
> what you are expecting... Just try this :
>
> System.out.println ( _dnTest.getUpName() );
>
> you will have your expected result. The toString() method is supposed 
> (well, whip me, it's terribly bad), to return a String for debug 
> usage. The worst thing is that this method is _never_ used anywhere in 
> the server, except in unit tests ;(

I can' believe as was _so wrong_ !!! This method is just used aroung 900 
times in the server. Modifying it would break it completly, and would 
take days to fix it back...

So far, I have created a new JIRA issue about this problem, but it's not 
likely to be fixed soon.

In the meantime, ust use getUpName().

Again, sorry. As a defense, I will argue that LdapDN is _not_ intended 
to be used from the outside of the server (yes, I know, this is a 
pathetic excuse :-/ )

>
> Ok, I promise I will fix that : switching the semantic of toString() 
> to it's correct value.
>
> Emmanuel
>
>> Bye,
>>
>> Tino.
>>
>>  
>>
>
>


Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
Tino Schwarze a écrit :

>On Fri, Dec 29, 2006 at 08:37:07PM +0100, Emmanuel Lecharny wrote:
>
>  
>
>>There is a _big_ difference, because your java file might have been 
>>saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default 
>>encoding of your computer to store the file, and inside this file you 
>>    
>>
>
>I was not talking about characters from a Java source, the data came
>directly from a database. The test case is pretty simple:
>
>    byte[] _bytes = new byte[] { (byte)0xc3, (byte)0xa4, (byte)0xc3, (byte)0xb6, (byte)0xc3, (byte)0xbc };
>      
>    String _s = new String (_bytes, "utf-8");
>    
>    // this should print äöü
>    System.out.println (_s);
>    LdapDN _dnTest = new LdapDN();
>    _dnTest.add ("cn=Uml"+_s);
>    System.out.println (_dnTest.toString());
>  
>-> this will get mixed up instantly by the RdnParser - it converts every
>non-ASCII byte to \xy encoding (I'm not sure who's doing the "\" to "%5C"
>conversion later).
>  
>
Ok, I get it now. First, apologize to have so badly name this method ... 
toString() is such a pathetic name when it does not give you back what 
you are expecting... Just try this :

 System.out.println ( _dnTest.getUpName() );

you will have your expected result. The toString() method is supposed (well, whip me, it's terribly bad), to return a String for debug usage. The worst thing is that this method is _never_ used anywhere in the server, except in unit tests ;(

Ok, I promise I will fix that : switching the semantic of toString() to it's correct value.

Emmanuel

>Bye,
>
>Tino.
>
>  
>


Re: UTF-8 woes

Posted by Tino Schwarze <ap...@tisc.de>.
On Fri, Dec 29, 2006 at 08:37:07PM +0100, Emmanuel Lecharny wrote:

> There is a _big_ difference, because your java file might have been 
> saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default 
> encoding of your computer to store the file, and inside this file you 

I was not talking about characters from a Java source, the data came
directly from a database. The test case is pretty simple:

    byte[] _bytes = new byte[] { (byte)0xc3, (byte)0xa4, (byte)0xc3, (byte)0xb6, (byte)0xc3, (byte)0xbc };
      
    String _s = new String (_bytes, "utf-8");
    
    // this should print äöü
    System.out.println (_s);
    LdapDN _dnTest = new LdapDN();
    _dnTest.add ("cn=Uml"+_s);
    System.out.println (_dnTest.toString());
  
-> this will get mixed up instantly by the RdnParser - it converts every
non-ASCII byte to \xy encoding (I'm not sure who's doing the "\" to "%5C"
conversion later).

Bye,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de

Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
Ersin Er a écrit :

> On 12/29/06, Emmanuel Lecharny <el...@gmail.com> wrote:
>
>> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
>>
>> You will have to be a little bit more explicit... How do you build 
>> your RDN?
>> FYI, it is supposed to be a UTF-8 encoded String, so if you are to 
>> code an
>> ä, you will have to :
>> - create a byte array containing it's counterpart (0xC3 0xa4) and do 
>> a new
>> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
>> - OR do a new RDN( "\u00e4" );
>>
>> never do a new RDN( "ä" ), because then the String will be considered as
>> ISO-8859-1 encoded  string (at least in Germany or in France, not in 
>> Turkey
>> :)
>
>
> What is the difference between creating an RDN with new RDN( "ä" ) and
> with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ? 

There is a _big_ difference, because your java file might have been 
saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default 
encoding of your computer to store the file, and inside this file you 
have this "ä". There is no guarantee at all that it will be correct when 
you transform the string to UTF-8 bytes on another computer, using a 
different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" 
) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 = 
unicode encoded using bytes), and then it helps to translate the String 
to UTF-16. Of course, using \u00e4 should be the prefered way if you are 
to use internal Strings like "This is an umlaut : \u00e4" in your java file.

> There is
> nothing as "UTF-8" String in Java. 

When you write new String( <some bytes>, "UTF-8" ), you just tell the 
JVM that the byte array is supposed to be a UTF-8 encoded String. It 
will trasnlate those bytes to UTF-16 chars, using one or two char if 
needed (Unicode can use up to 2^32 bits). For instance, the é in my name 
as a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. If 
you don't tell String() that the bytes array is UTF-8 encoded, then it 
will just consider that the byte array is using the default platform 
encoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '©', so you have 
now a Java String with is 2 chars long instead of one char long...

> All strings are UTF-16. You can get
> their representations in other encodings as byte arrays. So when you
> do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
> am I missing here?

It is transformed to UTF-16 accordingling to the encoding used on your 
platform. But then, if your local encoding is ISO-8859-1, when doing a 
String.getBytes( "UTF-8" ), you might have something very different to 
that you were expecting.

Ok, this is not simple. A simple rule then :
*always use \uxxxx when encoding non ASCII characters in a java file*

>
> (Not being able to display the character in source code in other
> platforms is a different matter. It's about the text editor encoding.)

yes, but you always use an editor to write your java file...

At this point, I may also miss something, but I would then like to have 
more informations like a test case which expose the problem.

Emmanuel.

Re: UTF-8 woes

Posted by Ersin Er <er...@gmail.com>.
On 12/29/06, Emmanuel Lecharny <el...@gmail.com> wrote:
> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
>
> You will have to be a little bit more explicit... How do you build your RDN?
> FYI, it is supposed to be a UTF-8 encoded String, so if you are to code an
> ä, you will have to :
> - create a byte array containing it's counterpart (0xC3 0xa4) and do a new
> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
> - OR do a new RDN( "\u00e4" );
>
> never do a new RDN( "ä" ), because then the String will be considered as
> ISO-8859-1 encoded  string (at least in Germany or in France, not in Turkey
> :)

What is the difference between creating an RDN with new RDN( "ä" ) and
with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ? There is
nothing as "UTF-8" String in Java. All strings are UTF-16. You can get
their representations in other encodings as byte arrays. So when you
do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
am I missing here?

(Not being able to display the character in source code in other
platforms is a different matter. It's about the text editor encoding.)

> Does it help you to figure out what can be the probem ?
>
> Emmanuel L\u00e9charny :)
>
>
> On 12/29/06, Tino Schwarze < apacheds.lists@tisc.de> wrote:
> > Hi there,
> >
> > we had an UTF-8 discussion this week, didn't we?
> >
> > I'm serving content from my open-EIS partition and UTF-8 characters get
> > garbled. I see that it happends because I build a RDN for the
> > SearchResult and dhe RdnParser will escape non-ASCII characters in the
> > values. (Like '\C3\A4' for German Umlaut ä).
> >
> > JXplorer shows this as %5cC3%5cA4 and sends it back this way when I try
> > to expand the appropiate node.
> >
> > Who is to blame and what can I do about that?
> >
> > Thanks,
> >
> > Tino.
> >
> > --
> > www.quantenfeuerwerk.de
> > www.spiritualdesign-chemnitz.de
> > www.lebensraum11.de
> >
>
>
>
> --
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com


-- 
Ersin

Re: UTF-8 woes

Posted by Tino Schwarze <ap...@tisc.de>.
On Fri, Dec 29, 2006 at 05:43:13PM +0100, Emmanuel Lecharny wrote:

> You will have to be a little bit more explicit... How do you build your RDN?

To be honest, I'm not yet familiar with LdapDN, Rdn etc....
I built a DN for SearchResult like this:

      LdapDN _dn = new LdapDN();

      // iterate over primary key attributes, then
      _dn.add(_att.getName()+"="+_the_att.get());

That was actually the easiest way I could figure out.

> FYI, it is supposed to be a UTF-8 encoded String, so if you are to code an
> ä, you will have to :

My application is already UTF-8 save.

> - create a byte array containing it's counterpart (0xC3 0xa4) and do a new
> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
> - OR do a new RDN( "\u00e4" );
> 
> never do a new RDN( "ä" ), because then the String will be considered as
> ISO-8859-1 encoded  string (at least in Germany or in France, not in Turkey
> :)

That would be equivalent to new RDN (new String (byteArray, "UTF-8")).

> Does it help you to figure out what can be the probem ?

Not yet...

Thanks,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de

Re: UTF-8 woes

Posted by Emmanuel Lecharny <el...@gmail.com>.
AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)

You will have to be a little bit more explicit... How do you build your RDN?
FYI, it is supposed to be a UTF-8 encoded String, so if you are to code an
ä, you will have to :
- create a byte array containing it's counterpart (0xC3 0xa4) and do a new
String( byteArray, "UTF-8" ) before passing it to the RDN constructor
- OR do a new RDN( "\u00e4" );

never do a new RDN( "ä" ), because then the String will be considered as
ISO-8859-1 encoded  string (at least in Germany or in France, not in Turkey
:)

Does it help you to figure out what can be the probem ?

Emmanuel L\u00e9charny :)

On 12/29/06, Tino Schwarze <ap...@tisc.de> wrote:
>
> Hi there,
>
> we had an UTF-8 discussion this week, didn't we?
>
> I'm serving content from my open-EIS partition and UTF-8 characters get
> garbled. I see that it happends because I build a RDN for the
> SearchResult and dhe RdnParser will escape non-ASCII characters in the
> values. (Like '\C3\A4' for German Umlaut ä).
>
> JXplorer shows this as %5cC3%5cA4 and sends it back this way when I try
> to expand the appropiate node.
>
> Who is to blame and what can I do about that?
>
> Thanks,
>
> Tino.
>
> --
> www.quantenfeuerwerk.de
> www.spiritualdesign-chemnitz.de
> www.lebensraum11.de
>



-- 
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: UTF-8 woes

Posted by Tino Schwarze <ap...@tisc.de>.
Hi,

On Fri, Dec 29, 2006 at 05:27:52PM +0100, Tino Schwarze wrote:

> I'm serving content from my open-EIS partition and UTF-8 characters get
> garbled. I see that it happends because I build a RDN for the
> SearchResult and dhe RdnParser will escape non-ASCII characters in the
> values. (Like '\C3\A4' for German Umlaut ä).
> 
> JXplorer shows this as %5cC3%5cA4 and sends it back this way when I try
> to expand the appropiate node.

I've added a simple decoder-function for my purposes which is called
before values from Rdn are used. It works for me(tm), but I'm not
satisfied and consider this a hotfix...

Bye,

Tino.

-- 
www.quantenfeuerwerk.de
www.spiritualdesign-chemnitz.de
www.lebensraum11.de