You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Stefan Zoerner <st...@labeo.de> on 2005/09/04 15:09:18 UTC

Normalizer vs. Comparator

Hi all!

I have a question regarding Normalizers (first of all) and Comparators.

Here is the whole story:
I faced the problem that the compare operation does not adhere the 
matching rules. Therefore I successfully modified the CompareHandler 
class in org.apache.ldap.server.protocol to do this (whether this is the 
best place to fix this problem is not the question here).
It worked better, but not all matching rules satisfied my needs (some 
are missing). One of these is telephoneNumberMatch, and I changed 
SystemComparatorProducer to replace ComparableComparator with something, 
that implements the missing matching rule.

Two options here to implement this Comparator:
1. just implement this interface Comparator, call it 
TelephoneNumberComparator
2. Create a Normalizer for telephone numbers (removing white space and 
hyphens, transform to e.g. lower case), and instantiate a 
NormalizingComparator in SystemComparatorProducer which uses it

This leads me (finally) to the question, where normalizers are intended 
to use. I do not want my telephone number get "normalized" before 
storing it, because that would delete the formatting, which people might 
like to preserve.
Example: attribute value "0251-123-3333 0" is stored as is, but adding 
an attribute value to the entry that matches according to the matching 
rule (e.g. "025112333330") is rejected (attribute value in use).

Thanks in advance, Stefan

Btw.: If you advice me to do the right thing, I'll contribute the 
matching rule implementations which are missing. I need them to make my 
compare ops work.

Re: Normalizer vs. Comparator

Posted by Stefan Zoerner <st...@labeo.de>.

Alex Karasulu wrote:
> Stefan Zoerner wrote:
> 
>> Hi all!
> 
> 
> Hey sorry for taking so long to respond.
> 
No problem Alex. There are currently so many things for me to do ...
> 
> Hope this helps,
> Alex
> 
Thank you very much for taking the time to describe the relation of 
these components and their deeper meaning. And yes, it was very helpful.
I will add some test cases for matching rules in the near future (for 
the compare op I alredy have some). It looks to me that I will be able 
to make some of the missing rules work (those not critical ones), at 
least as implemented in the current SystemComparatorProducer.

Stefan.

Re: Normalizer vs. Comparator

Posted by Alex Karasulu <ao...@bellsouth.net>.

Stefan Zoerner wrote:

> Hi all!

Hey sorry for taking so long to respond.

> Here is the whole story:
> I faced the problem that the compare operation does not adhere the 
> matching rules. Therefore I successfully modified the CompareHandler 
> class in org.apache.ldap.server.protocol to do this (whether this is 
> the best place to fix this problem is not the question here).

Ok some theory behind these constructs might shed some light on what 
role they serve in the server. 

Most LDAP servers have a means to extend the schema however this means 
is extremely limited when it comes to defining new Syntaxes or new 
MatchingRules.  Really these constructs are often built into the server 
and cannot be changed without code changes.

When I started designing the schema subsystem of ApacheDS (still not 
finished) I wanted her to be able to be extended for new Syntaxes and 
new MatchingRules.  To do this I had to understand the fundamental 
components needed to represent new matchingRules and syntaxes.  For 
syntaxes I created an interface called SyntaxChecker.  Every syntax must 
have a SyntaxChecker in order for the schema subsystem to check for 
proper attribute value syntax.  This SyntaxChecker can be a simple regex 
or an entire parser.  As long as the interface is adhired to the schema 
subsystem can use it to determine if correct values are being used for 
attributeTypes based on a schema.

The other half dealing with Comparators and Normalizers is much more 
complex and for this you must really understand what a matchingRule 
does.  The server uses matching rules to determine equality and 
ordering.  Before it can do this string prep must be run on some values 
(normalization) to remove the chance for varience to enter the picture.  
Hence matchingRules can be broken down into Comparators and 
Normalizers.  Some may think a Normalizer is syntax specific however how 
you want to match effects normalization not the syntax.  For example if 
I have an attribute that is a simple string and I want to perform a case 
insensitive match then the normalization changes from a case sensitive 
match.  This shows how normalization is specific to matching an not just 
a syntax.

Anyways Normalizers and Comparators are the basis to matchingRules.  A 
new matchingRule must have these defined for its OID as you probably saw.

> It worked better, but not all matching rules satisfied my needs (some 
> are missing). 

Yep we have not filled in any of these really.  Just some very critical 
ones so the directory can operate.  We need help in filling these in.

> One of these is telephoneNumberMatch, and I changed 
> SystemComparatorProducer to replace ComparableComparator with 
> something, that implements the missing matching rule.
>
Cool.  This is exactly what we need to do.

> Two options here to implement this Comparator:
> 1. just implement this interface Comparator, call it 
> TelephoneNumberComparator
> 2. Create a Normalizer for telephone numbers (removing white space and 
> hyphens, transform to e.g. lower case), and instantiate a 
> NormalizingComparator in SystemComparatorProducer which uses it
>
Right these would be the two steps to follow.  One for the Comparator 
and another for the normalizer.

> This leads me (finally) to the question, where normalizers are 
> intended to use. I do not want my telephone number get "normalized" 
> before storing it, because that would delete the formatting, which 
> people might like to preserve.

Good question.  Let me try to answer this ...

Normalization is critical while attempting to match two values 
together.  Sometimes there is extra white space and it can be removed to 
better enable correct comparisons.  Sometimes normalization is not even 
needed if the syntax is very rigid without any room for case or space 
variance.  Consider matching for cn=Stefan Zoerner which is in the 
directory (this is what the user who added an entry put as the cn 
attribute value).  Now another user that is searching for these entries 
may ask for cn=STEFAN    ZOERNER with 3 spaces between STEFAN and 
ZOERNER.  The two users may be the same or different users.  The second 
user should be able to to pull the same entries regardles of which 
filter he uses below:

(cn=STEFAN   ZOERNER)
(cn= Stefan ZOerner)
(cn=stefan                    zoerner)

So a normalizer would come into play here by generating a canonical 
representation of these inputs.  ApacheDS by  default case normalizes by 
reducing case to lowercase and then comparing the filter string with the 
normalized attribute value stored within the directory: this is only 
done for matching rules that ignore case.  For whitespace normalization 
ApacheDS tries to follow the string prep operation defined in various 
ietf documents.  However I'm sure we fall short.  The general rule of 
thumb for ApacheDS is to whitespace normalize while retaining string 
tokenization order.  Meaning we do a deep trim of values replacing 
whitespace with a single space character.  Whitespace on the ends are 
discarded.  This btw is only done when space and whitespace in general 
is not escaped.

Hope this helps,
Alex