You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@gmail.com> on 2006/12/25 22:15:33 UTC

String preparations

Hi guys !

I'm currently working on the implementation of RFC 4518, which says that 
to be able to apply MatchingRules on String values, we should transfomr 
them
('Prepare').

This transformation is a 6 steps process, pretty boring, and somwhere in 
the middle, there is a Normalization steps, where characters may be 
transformed to multi-characters like : "Schön" will be transformed to 
"Scho\u0308n" (the ö is transformed to a simple 'o' plus a code) (not 
that this is *not* a good exemple, because the transformation we must 
implement is different. It's NFKC transformation (for those who have 
_nothing_ else to do, or who had an argument with boyfriend/girlfriend 
and has a lot of time to waste, waiting he/she cools down, here is the 
doco : 
http://www.unicode.org/unicode/reports/tr15/tr15-22.html#Specification)

Ok, now, the point is : in Java 5, there is nothing in the API to do 
this normalizer (Java 6 has it !), but as we won't switch to java 6, it 
lefts us with few options :
1) why the hell do we need to take care of those bloody countries with 
bloody letters - hieroglyph, or whatever I can't read - that exceed the 
Beauty of US-ASCII ???
2) damn, I'm french/german/turk/... (ISO-3166, pick your country) and my 
name does not make it with US-ASCII (like Szörner, or Lécharny :). I 
have to do some normalization...
2-a) Let's wait for Java 6... We are not in a hurry, the current code 
covers 99,9999999% of all the cases.
2-b) Let's use apache-abdera Unicode impl, it seems pretty complete
2-c) I feel like implementing this Normalizer myself, because I LOVE 
Unicode ! (I know all of  the 1 156 345 characters, and I can draw them 
knowing only their values... Actually, I also do crack, and I am a 
speaker at each Unicoke conference ...)

Ok, ok, I think that 2-b make the trick, from my point of view. wdyt ?

Emmanuel L\u00e9charny

Oh, great idea if you forgot to send a gift to your mother-in-law, the 
last Unicode spec version, only 1450 pages !  :
http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=9780321480910&displayonly=TOC&z=y#TOC

Re: String preparations

Posted by Emmanuel Lecharny <el...@gmail.com>.

On 12/28/06, Alex Karasulu <ak...@apache.org> wrote:
>
> Emmanuel Lecharny wrote:
> > Hi guys !
> >
> > I'm currently working on the implementation of RFC 4518, which says that
> > to be able to apply MatchingRules on String values, we should transfomr
> > them
> > ('Prepare').
>
> :) thanks for picking up this task E.
>
> Just a note of caution.  Remember that you must *ONLY* apply string prep
> if you are comparing with non-normalized values.


String prep is for assertion (incoming values) or for non-indexed values.
And the only kind of Strings that should be 'prepared' are those which types
is DirectoryString (PrintableString is a subset, so are TelephoneNumber, and
teletexString...). String Values which will be stored in an index have to be
prepared too. The normalization process is just the application of this
preparation. In some way, Normalizers  in ADS = Stringpreparation minus a
lot of tricky unicode manipulations (like mapping, bidi handling, etc). Of
course, some cases must be handled, like Lowercasing or not, etc...

Normalization must apply string prep to values to produce the canonical
> representation.  Also as you know values within indices are normalized
> and hence already have string prep (pseudo string prep as I implemented
> it) applied.  So you need not apply string prep on indexed attribute
> values.  String prep must be applied when comparing values directly from
> the entry pulled out of the master table when indices are not available.


StringPrep must be applied to :
- assertion values
- attributes values which are not indexed.


> This transformation is a 6 steps process, pretty boring, and somwhere in
> > the middle, there is a Normalization steps, where characters may be
> > transformed to multi-characters like : "Schön" will be transformed to
> > "Scho\u0308n" (the ö is transformed to a simple 'o' plus a code)
>
> Yes this is an additional way to normalize IMO.


This is RFC 4518, which is a little bit too advanced for current LDAP
servers :)


> OK.
>
> > 2-a) Let's wait for Java 6... We are not in a hurry, the current code
> > covers 99,9999999% of all the cases.
>
> NP I'm fine with that.


Yes, I think this is reasonnable...

> Ok, ok, I think that 2-b make the trick, from my point of view. wdyt ?
>
> 2-b seems nice but I'm fine with 2-a too.  Right now we have bigger fish
> to fry than making ADS work with UNICODE based languages.  Sorry but
> other LDAP servers have taken the same approach.


Don't be sorry. Transforming ADS to comply fully with RFC 4518 will be
overkilling. Let's do that in 2.0 (or maybe in 3.0 :)

Emmanuel

Re: String preparations

Posted by Alex Karasulu <ak...@apache.org>.

Emmanuel Lecharny wrote:
> Hi guys !
> 
> I'm currently working on the implementation of RFC 4518, which says that
> to be able to apply MatchingRules on String values, we should transfomr
> them
> ('Prepare').

:) thanks for picking up this task E.

Just a note of caution.  Remember that you must *ONLY* apply string prep
if you are comparing with non-normalized values.

Normalization must apply string prep to values to produce the canonical
representation.  Also as you know values within indices are normalized
and hence already have string prep (pseudo string prep as I implemented
it) applied.  So you need not apply string prep on indexed attribute
values.  String prep must be applied when comparing values directly from
the entry pulled out of the master table when indices are not available.

> This transformation is a 6 steps process, pretty boring, and somwhere in
> the middle, there is a Normalization steps, where characters may be
> transformed to multi-characters like : "Schön" will be transformed to
> "Scho\u0308n" (the ö is transformed to a simple 'o' plus a code) 

Yes this is an additional way to normalize IMO.

(not
> that this is *not* a good exemple, because the transformation we must
> implement is different. It's NFKC transformation (for those who have
> _nothing_ else to do, or who had an argument with boyfriend/girlfriend
> and has a lot of time to waste, waiting he/she cools down, here is the
> doco :
> http://www.unicode.org/unicode/reports/tr15/tr15-22.html#Specification)
> 
> Ok, now, the point is : in Java 5, there is nothing in the API to do
> this normalizer (Java 6 has it !), but as we won't switch to java 6, it
> lefts us with few options :
> 1) why the hell do we need to take care of those bloody countries with
> bloody letters - hieroglyph, or whatever I can't read - that exceed the
> Beauty of US-ASCII ???

I don't think we do.  Will be nice when we do though on switching to J6.

> 2) damn, I'm french/german/turk/... (ISO-3166, pick your country) and my
> name does not make it with US-ASCII (like Szörner, or Lécharny :). I
> have to do some normalization...

OK.

> 2-a) Let's wait for Java 6... We are not in a hurry, the current code
> covers 99,9999999% of all the cases.

NP I'm fine with that.

> 2-b) Let's use apache-abdera Unicode impl, it seems pretty complete

That's an option.
...

> Ok, ok, I think that 2-b make the trick, from my point of view. wdyt ?

2-b seems nice but I'm fine with 2-a too.  Right now we have bigger fish
to fry than making ADS work with UNICODE based languages.  Sorry but
other LDAP servers have taken the same approach.

Alex

Re: String preparations

Posted by Alex Karasulu <ak...@apache.org>.

Ole Ersoy wrote:
> I would wait for Java six.  

+1

Alex

Re: String preparations

Posted by Ole Ersoy <ol...@yahoo.com>.

I would wait for Java six.  See you at the unicoke
conference :-)

Happy Holidays!

- Ole


--- Emmanuel Lecharny <el...@gmail.com> wrote:

> Hi guys !
> 
> I'm currently working on the implementation of RFC
> 4518, which says that 
> to be able to apply MatchingRules on String values,
> we should transfomr 
> them
> ('Prepare').
> 
> This transformation is a 6 steps process, pretty
> boring, and somwhere in 
> the middle, there is a Normalization steps, where
> characters may be 
> transformed to multi-characters like : "Schön" will
> be transformed to 
> "Scho\u0308n" (the ö is transformed to a simple 'o'
> plus a code) (not 
> that this is *not* a good exemple, because the
> transformation we must 
> implement is different. It's NFKC transformation
> (for those who have 
> _nothing_ else to do, or who had an argument with
> boyfriend/girlfriend 
> and has a lot of time to waste, waiting he/she cools
> down, here is the 
> doco : 
>
http://www.unicode.org/unicode/reports/tr15/tr15-22.html#Specification)
> 
> Ok, now, the point is : in Java 5, there is nothing
> in the API to do 
> this normalizer (Java 6 has it !), but as we won't
> switch to java 6, it 
> lefts us with few options :
> 1) why the hell do we need to take care of those
> bloody countries with 
> bloody letters - hieroglyph, or whatever I can't
> read - that exceed the 
> Beauty of US-ASCII ???
> 2) damn, I'm french/german/turk/... (ISO-3166, pick
> your country) and my 
> name does not make it with US-ASCII (like Szörner,
> or Lécharny :). I 
> have to do some normalization...
> 2-a) Let's wait for Java 6... We are not in a hurry,
> the current code 
> covers 99,9999999% of all the cases.
> 2-b) Let's use apache-abdera Unicode impl, it seems
> pretty complete
> 2-c) I feel like implementing this Normalizer
> myself, because I LOVE 
> Unicode ! (I know all of  the 1 156 345 characters,
> and I can draw them 
> knowing only their values... Actually, I also do
> crack, and I am a 
> speaker at each Unicoke conference ...)
> 
> Ok, ok, I think that 2-b make the trick, from my
> point of view. wdyt ?
> 
> Emmanuel L\u00e9charny
> 
> Oh, great idea if you forgot to send a gift to your
> mother-in-law, the 
> last Unicode spec version, only 1450 pages !  :
>
http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=9780321480910&displayonly=TOC&z=y#TOC
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com