You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Alex Karasulu <ak...@apache.org> on 2007/04/06 08:56:59 UTC

[ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Ole,

Changing the topic ...

On 4/6/07, Ole Ersoy <ol...@gmail.com> wrote:
>
>
> Each Entry has a set of ObjectClasses associated with it.
> Those object classes determine the set of AttributeTypes that
> the entry can have.
>
> What I'm wondering about is when I look up a value of an entry,
> does ApacheDS pass say:
>
> org.apache.tuscany.DASConfig.baseDN as the key to look up the value
> of this attribute?
>
> Or does ApacheDS create a proxy key?
>
> Like this org.apache.tuscany.DASConfig.baseDN = 1 for example.
>
> And then when it stores the attribute it knows that the
> AttributeType org.apache.tuscany.DASConfig.baseDN corresponds to 1,
> so rather than storing
>
> org.apache.tuscany.DASConfig.baseDN
>
> 200M times, it stores the 1 instead.  So the 1 is the proxy
> for
>
> org.apache.tuscany.DASConfig.baseDN
>
> and ApacheDS keeps a list of proxies like this for all the
> AttributeTypes that it has.
>
> Does that make sense?


Ok I understand now.  The server's default partition implementation based on
B+Trees (using jdbm)
does something similar to this but not exactly.  Let me explain:

First somewhat related the partition assigns a surrogate key to all entries
instead of using it's DN
as the PK.  This prevents certain issues that can result when using a BTree
to index on the DN
like dealing with long key prefixes.  I think Emmanuel at some point had the
idea of storing the
DN in reverse within the index to avoid these problems.

Second for attributeTypes the JdbmPartition uses the OID as the PK for the
attribute rather than
using the alias of the attribute.  When you supply the server a filter like
(cn=   Ole     Ersoy)
(*NOTE* the case and the extra spaces) the server parses the filter into a
AST (which is one node)
and normalizes the attributeType in this node's attribute value assertion:
the cn=   Ole     Ersoy
becomes 2.5.4.3=ole ersoy.

cn                    => 2.5.4.3
'   Ole    Ersoy'  => 'ole ersoy'

This is done by the normalizationServer (an interceptor) before reaching the
partition.  Then when
the partition receives this normalized filter it's search engine will check
and see if an index exists
for 2.5.4.2.  If one does not exist then all the entries are pulled from a
master table and a full
scan occurs where each entry's 2.5.4.3 attribute is looked up, the value is
normalized and compared
against the value in the filter using the comparator associated with the
EQUALITY matchingRule of
the attributeType 2.5.4.3.

If an index does exist you're a very lucky user :).  The search engine then
looks into the 2.5.4.3
index using the key 'ole ersoy'.  Indices by the way store normalized values
as keys (using the normalizer
of the EQUALITY matchingRule associated with the attributeaType the index is
built on) and the
entry ID as values.  With a BTree this is almost a constant operation (log
n).   Oh and the keys into the
index as well as values are sorted using the comparator of the ORDERING
matchingRule for the
attributeType the index is built on.  So once the value is found the
partition recovers the entry id and
pulls out the entry from the master table.  Then it advances to the next
value that equals 'ole ersoy'
and does the same until all 'ole ersoy' index records have been returned.

So long answer this time shows that the server sort of uses a surrogate key
for attributeTypes and that
is the OID for the attributeType.  Rather the server uses the OID as the PK
would be a correct statement
since the OID is not a surrogate (derived/artificial) key.

Now for the objectClass attribute's values in entries:  By default the
JdbmPartition comes with an index
on this very special attributeType.  If it did not we'd be hosed.  When
adding new entries the ORDERING
matchingRule for the objectClass attributeType is used and this matchingRule
will transform the values for
the objectClass attribute into the OID of the objectClasses for that entry.


Emmanuel correct me if I am wrong here.

The entry's objectClass values when stored in the master table are not
touched.  It is stored as is
so it can be returned as the user added it with case variance.  If we did
not do this then the server would
return normalized values for the objectClass.  In fact no attributes of an
entry in the master table are
normalized: not objectClass, not cn, not anything.  So what is stored in the
master table is the entry as
it was supplied or modified.  The objectClass index however will contain for
the index record key
normalized values of the objectClass values which are the OIDs of the
objectClasses.

Here too the server uses the OID as the PK (again this is really not a
surrogate key after all).

Now the big question is what impact will there be if we used a real
surrogate key instead?  First of we
would not use String based comparators but would use Integer comparators for
both objectClasses and
attributeTypes assigned through some scheme.  Let's consider each scenario
separately.

Using Surrogate Keys for AttributeTypes
----------------------------------------------------------

If this is done then the server must manage a persistent mechanism to
translate alias names and OIDs to
the surrogate key consistently across restarts.  This table can be
immediately on start up loaded into memory
and written to disk on change like a write through cache.  We would still
have to store entries in their 'user
provided' form where non of the values of attributes or the names of the
attributes for that matter are normalized.
This is to ensure users get back entries as they put them into the server.
When preparing a filter for normalization
the attributes in the attribute value assertion (ava) would need to be
normalized into the surrogate key instead of
the OID.  The value would be processed in the original manner it was handled
in before.

When finding indices for attributes in an AVA we would look them up via the
surrogate key for the attributeType.
Then all operations would proceed as usual.  There is no space conservation
advantage here while incurring an
extra in memory lookup to transform the OID/alias into a surrogate key.
Plus there is the overhead in memory
of maintain this OID-alias to surrogate key mapping.

Conclusion: not worth doing for any reason at all.

Using Surrogate Keys for ObjectClasses
-----------------------------------------------------------

Again the entry is stored as-is in the master table with all the original
values for the objectClass attribute in the
entry.  The objectClass index instead of storing tuples like:

 OID : Entry ID
 (2.5.6.3, 96)

Will now use the surrogate key for the OID and look like:

SKey : Entry ID
( 563, 96 )

If 563 is the surrogate key assigned to the objectClass organization (with
OID 2.5.6.3).

BTW there is no need to normalize filters here so their attributes us the
SK.  The objectClass attribute
is just a very special case.

This will save some space overhead.  It might even result in a slightly
faster lookup within the btree because
there are less bytes to compare in most cases with the integer based
surrogate key than with the OID string's
bytes.  Actually now that I think about it this might be much faster since
the OID String needs a byte[]->String
transform to be properly compared.  The BTree uses a fast StringComparator
for this but still it will cost more
than using an IntegerComparator in all cases.

The space conservation on this index is high but the overall conservation in
the partition is not going to be
much.  The performance impact to search expressions based on the objectClass
attribute would be much
faster.  But keep in mind that it will only occur with this attribute yet
this is a common attribute use in most
search operations.  So it might be worth while using.  An experiment might
be in order here.


High Level Impact
--------------------------

For attributeTypes it's a very bad idea to use SKs for the attributeType
instead of the OID.  First of all
the normalization of the filter AVA's attribute to the SK will make the
filter unintelligible to anything but
the JdbmPartition implementation which knows how to handle these search
operations.  Every partition
implementation would then be tasked with doing this same thing to interpret
the filter's attributes.

For objectClasses the impact could be significant.  However the partition
implementation would need
to map objectClasses to assigned SKs and do it consistently across
restarts.  As it stands now the
partition is designed to not have to know anything other than about
attributeTypes, their syntaxes and
matchingRules to properly conduct CRUD operations on entries.  Having to do
this means changing the
jdbm implementation a tiny bit.  Technically the Partition interface need
not change since the
ObjectClassRegistry can be accessed from the init() stage when starting up
the partition via the
server configuration object.

Conclusion
----------------

The use of SKs for ATs is a bad idea in all aspects without much benefit.
The use of SKs for OCs
might have some space conservation but not major.  The performance advantage
is questionable and
requires solid performance metrics to determine the true value of such a
'one off' being added to the
server.  Is the complexity worth the performance boost is the question to
answer.

Alex

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.

> 
> 
> The sequene diagram will not fit on a A-3 page (note  I didn't used A3 : 
> A4 < A3 < A2... < A0 < A-1 < A-2 < A-3 etc...) (bwt, the largest 
> available format for paper is A0 atm, and Ai = 4xA(i+1) ...)

Yeah - It would be nice to have one in a web page though.

> 
>     <snip/>
> 
>     <snip/>
> 
> 
>     My goal is to keep the 200M entries in Memory.
> 
> 
> Forget about the idea to store 200 M entries in memory. This is just 
> impossible. An entry is around 1Kbytes, and you won't ever have a 200 Gb 
> mem server ...

Yeah - I just wanted to use an extreme example because
I think that storing "1" 200 M times in memory
or "500" 200 M times in memory will result in about
half the memory consumed vs. storing something like:

1.434.434534.4353465, although I would have to run tests
to validate that.

SNIP
>  
> This is something that will be available in 2.0. We have discussed about 
> it with Alex those last three months, and I think we will have a level 
> of indirection. Basically, if you have 500 attributeTypes, pointing on 
> 200M entries, then you will have a N-N relation between AT and entries. 
> This will be solved with an intermediate table, with Longs in it :
> AT-Long / entry-Long
> where AT-long represent the AtttributeType unique ID into ADS and the 
> very same for Entry-long
> 
> This is explained here : 
> http://cwiki.apache.org/confluence/display/DIRxSRVx11/Backend

Awesome - That makes sense - I think between this, the material Alex 
sent, and the DAS I should be able to stay busy for a little while :-)

Thanks Emmanuel,
- Ole




Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.

Emmanuel Lecharny wrote:
SNIP

>> Although once the DAS gets done, people could start using
>> ADS as an RDB.
> 
> A ldap server is *not* a RDB. Whatever you use it for, this is a 
> hierarchical database.

OK Bro - Now you're preaching to the quire :-)

Your earlier point was that
a ldap server is 99,99% and less tha 0,01 write.

My point is that when the DAS gets done, people have the
ability to use ADS just as an RDB and the read/write
ratio could hence change accordingly.

 From the point of view of a DAS there's no
difference between an RDB and LDAP, it's just
a datasource.

Some people will want to write a lot and some
will just continue to store mostly read data.

This naturally depends on how the server performs
with writes, which over time depends on the server's
road map.

If hsql or derby performs 5X better than ADS for writes
and is a little slower in reading then they may choose that
instead.

So it really depends on the types of usage /
performance scenarios scenarios ADS wants to be able to "Brag" about.

> 
>> <snip/>
>>
>>> So it definitively worth the price to spend a *lot* of time writing 
>>> twice the data in to different forms than do a computation for each 
>>> search. Adding entries in ADS is 10 to 20 times slower than reading 
>>> them.
>>
>>
>> Is that because they are written to many different forms during the 
>> one write.
>>
>> It would be neat if it could just write one form. per configuration,
>> assuming a certain usage scenario.
> 
> It's up to you : just don't add indices. But then performance will suck 
> big time. For every search not using an index, the cost is a full scan. 
> There is no free beer.

OK - Cool - Because there are essentially two DAS usage scenarios.

Scenario A where someone just wants to do regular CRUD
using the entire DataGraph instance (thus not searching) and B that 
searches/filters the members of a persisted DataGraph instance
during its recreation.

> 
> Emmanuel
> 
Thanks,
- Ole


Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.

Emmanuel Lecharny wrote:
> Ole Ersoy a écrit :
> 
>> So the Normalized form is always minimized?
> 
> I think we already answered this question. This is really common sense.

Well when I asked what can be done to minimize the size of an entry
you said nothing.

To me that means that ADS stores entries in at least one form that is 
minimized
structurally (Structurally meaning only the parts used to look up data
that is being retrieved), so I'm just confirming.

So what I would appreciate is just a "Yes", unless I'm missing something.

Cheers,
- Ole



Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Emmanuel Lecharny <el...@gmail.com>.
Ole Ersoy a écrit :

> So the Normalized form is always minimized?

I think we already answered this question. This is really common sense.

>
>> For performance reasons, we should not compute the normalized form 
>> each time we do a search operation, this will kill the server. keep 
>> in mind that a ldap server is 99,99% read, and less tha 0,01write. 
>
>
> Although once the DAS gets done, people could start using
> ADS as an RDB.

A ldap server is *not* a RDB. Whatever you use it for, this is a 
hierarchical database.

> <snip/>
>
>> So it definitively worth the price to spend a *lot* of time writing 
>> twice the data in to different forms than do a computation for each 
>> search. Adding entries in ADS is 10 to 20 times slower than reading 
>> them.
>
>
> Is that because they are written to many different forms during the 
> one write.
>
> It would be neat if it could just write one form. per configuration,
> assuming a certain usage scenario.

It's up to you : just don't add indices. But then performance will suck 
big time. For every search not using an index, the cost is a full scan. 
There is no free beer.

Emmanuel

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.

Emmanuel Lecharny wrote:
SNIP
> well, the short answer is 'no'. As stated by alex, we should keep two 
> forms : the user provided form, and the normalized form. 

So the Normalized form is always minimized?

>For performance 
> reasons, we should not compute the normalized form each time we do a 
> search operation, this will kill the server. keep in mind that a ldap 
> server is 99,99% read, and less tha 0,01write. 

Although once the DAS gets done, people could start using
ADS as an RDB.

What I'm really trying to understand is whether
the server can be setup to do something like this,
because I think it would minimize an in
memory partition.

Right now I'm just thinking that I have an attribute
value I want to get.  I have the DN of the entry and
I know which attribute.

The DN: ou=blah ou=blah ou=blah

The Attribute Name: 
com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something

So I tell JNDI to go and get this.

So it tells ADS.

Then ADS looks up
com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something

in a map<name, key>
where name is the attribute name
com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something
and the value is the key used to look up the value of

com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something
in the entry.

So It uses the map to look this up,
and it gets a return value like "500".

Then it gets the entry with DN:
ou=blah ou=blah ou=blah

And uses the key "500"
to look up the value of
com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something

This is just my impression of what would be fast and use little memory.

I'm sorry I have not really have had time to understand the search 
aspects yet.  I've read Alex's mail twice already, but I need to break
it down more for the concepts to sink in.  So I hope it's OK that
I write it down as I understand it.  I just want to make sure I throw
it out there as clearly as I can in case it could be useful.

For the DAS just reading and writing datagraph's I think this type
of partition architecture would perform really well, but for
search I can see how it can be very different.

>So it definitively worth 
> the price to spend a *lot* of time writing twice the data in to 
> different forms than do a computation for each search. Adding entries in 
> ADS is 10 to 20 times slower than reading them.

Is that because they are written to many different forms during the one 
write.

It would be neat if it could just write one form. per configuration,
assuming a certain usage scenario.

> 
>>
>> So when I'm using JNDI to update an attribute, and the
>> key of my Attribute is 
>> "com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something"
>> ApacheDS takes that key and turns it into the shortest possible number
>> it can before storing it?
> 
> yes. We have the exact equivalence to "sequecnes" in oracle, and we use 
> them to give a Long for each entry, and for each indexed attribute.

OK - Now it sort of sounds like we are saying the same thing I think.

> 
>>
>> But is this really important ? Just think about the 80/20 rule
>>
>>> (and it's much closer to a 95/5) : 20 percent of all entries will be 
>>> accessed 80% of time. A good cache will usually gives you the same 
>>> result (or close to) as if you put everything in memory. This is very 
>>> basic IT theory...
>>
>>
>> Yes - Totally - For search operations that type of tweaking is awesome
>> and effective.  It applies to Supply Chain Applications some times, 
>> and other times all the data is fair game.  For instance
>> the application might be calculating Optimal Inventory Figures for all
>> SKUs and and wants to do the run "Superfast", so it wants all the data
>> in memory.
> 
> Maybe in Supply Chain Apps. But a Ldap Server is totally different. 
> Don't think like if you only have a hammer... In your case, Ldap being 
> very fast, even compared to a RDBMS, it might be interesting to use i. 
> But you should also consider other elements, like the cost of writing in 
> it, and the cost of a traversal (doing a full scan).

Yeah - I left that part of the DAS Design Guide out for now :-)
I'll start thinking through how searches are done as soon
as I have a prototype for just reading and writing DataGraph instances.

Thanks for putting up with all my "Brain Queries".

- Ole

SNIP

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Emmanuel Lecharny <el...@gmail.com>.
On 4/7/07, Alex Karasulu <ak...@apache.org> wrote:
>
> Hi Emmanuel,
>
> You sited 10-20 time slower write operations over search in ApacheDS.  Is
> this from
> some performance metrics you did?
>

Yes, done last october. We were able to add up to 200 entries per second,
when we were able to read around 1000/s. But the write operation cost much
more if you add some indices, and also if you have big trees because the
indices are getting bigger (do to the fact that we serialize them before
storing them).

I'd love to see those if you have them.  Perhaps
> we can put all this info onto a performance page.  I think people will be
> curious about
> this.
>

Very true. This is soimething we have to add somewhere on the site.

-- 
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Alex Karasulu <ak...@apache.org>.
Hi Emmanuel,

On 4/6/07, Emmanuel Lecharny <el...@gmail.com> wrote:
>
> Ole Ersoy a écrit :
>
> >
> > So when I'm using JNDI to update an attribute, and the
> > key of my Attribute is
> > "com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something"
> > ApacheDS takes that key and turns it into the shortest possible number
> > it can before storing it?
>
> yes. We have the exact equivalence to "sequecnes" in oracle, and we use
> them to give a Long for each entry, and for each indexed attribute.


Perhaps you're referring to the DN and the entryID sequence.  We use OID's
for attributeTypes
instead of alias names like cn, and commonName which may vary along with
their case.

You sited 10-20 time slower write operations over search in ApacheDS.  Is
this from
some performance metrics you did?  I'd love to see those if you have them.
Perhaps
we can put all this info onto a performance page.  I think people will be
curious about
this.

Alex

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Emmanuel Lecharny <el...@gmail.com>.
Ole Ersoy a écrit :

> Another thing just popped into my head.
>
> What if the DAS had a configuration option
> telling ADS what forms to write entries in.
>
> For instance sometimes the user will know
> that they only want the normalized form
> written...
>
> Thoughts?

Then the user has to create normalized entries. It won't change the way 
data will be stored inside ADS. And even if the client send normalized 
data to the server, they will be still stored in two forms : User 
provided (here, normalized) and normalized.

Emmanuel

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.
Another thing just popped into my head.

What if the DAS had a configuration option
telling ADS what forms to write entries in.

For instance sometimes the user will know
that they only want the normalized form
written...

Thoughts?

Thanks,
- Ole




Emmanuel Lecharny wrote:
> Ole Ersoy a écrit :
> 
>> What I mean by that is what is the minimum
>> size of an entry considering only the fixed parts
>> or the keys of the entry effectively + anything
>> else that is needed to manage the entry that has
>> to be stored in memory.
> 
> Then it's around 1kb. Of course, this is a rough estimation.
> 
>> So what can be done to minimize the size?
>>
>>>
>>> Nothing. 
>>
>>
>> So what that then means is that the structural component of
>> an entry is always minimized within ApacheDS?
> 
> well, the short answer is 'no'. As stated by alex, we should keep two 
> forms : the user provided form, and the normalized form. For performance 
> reasons, we should not compute the normalized form each time we do a 
> search operation, this will kill the server. keep in mind that a ldap 
> server is 99,99% read, and less tha 0,01write. So it definitively worth 
> the price to spend a *lot* of time writing twice the data in to 
> different forms than do a computation for each search. Adding entries in 
> ADS is 10 to 20 times slower than reading them.
> 
>>
>> So when I'm using JNDI to update an attribute, and the
>> key of my Attribute is 
>> "com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something"
>> ApacheDS takes that key and turns it into the shortest possible number
>> it can before storing it?
> 
> yes. We have the exact equivalence to "sequecnes" in oracle, and we use 
> them to give a Long for each entry, and for each indexed attribute.
> 
>>
>> But is this really important ? Just think about the 80/20 rule
>>
>>> (and it's much closer to a 95/5) : 20 percent of all entries will be 
>>> accessed 80% of time. A good cache will usually gives you the same 
>>> result (or close to) as if you put everything in memory. This is very 
>>> basic IT theory...
>>
>>
>> Yes - Totally - For search operations that type of tweaking is awesome
>> and effective.  It applies to Supply Chain Applications some times, 
>> and other times all the data is fair game.  For instance
>> the application might be calculating Optimal Inventory Figures for all
>> SKUs and and wants to do the run "Superfast", so it wants all the data
>> in memory.
> 
> Maybe in Supply Chain Apps. But a Ldap Server is totally different. 
> Don't think like if you only have a hammer... In your case, Ldap being 
> very fast, even compared to a RDBMS, it might be interesting to use i. 
> But you should also consider other elements, like the cost of writing in 
> it, and the cost of a traversal (doing a full scan).
> 
>> Yes - I usually go for 1 first, then pump it up :-)
>> I like the band too.
> 
> I prefer dead kennedys (http://www.deadkennedys.com/), but I'm a psycho %-]
> 
> Emmanuel
> 

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Emmanuel Lecharny <el...@gmail.com>.
Ole Ersoy a écrit :

> What I mean by that is what is the minimum
> size of an entry considering only the fixed parts
> or the keys of the entry effectively + anything
> else that is needed to manage the entry that has
> to be stored in memory.

Then it's around 1kb. Of course, this is a rough estimation.

> So what can be done to minimize the size?
>
>>
>> Nothing. 
>
>
> So what that then means is that the structural component of
> an entry is always minimized within ApacheDS?

well, the short answer is 'no'. As stated by alex, we should keep two 
forms : the user provided form, and the normalized form. For performance 
reasons, we should not compute the normalized form each time we do a 
search operation, this will kill the server. keep in mind that a ldap 
server is 99,99% read, and less tha 0,01write. So it definitively worth 
the price to spend a *lot* of time writing twice the data in to 
different forms than do a computation for each search. Adding entries in 
ADS is 10 to 20 times slower than reading them.

>
> So when I'm using JNDI to update an attribute, and the
> key of my Attribute is 
> "com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something"
> ApacheDS takes that key and turns it into the shortest possible number
> it can before storing it?

yes. We have the exact equivalence to "sequecnes" in oracle, and we use 
them to give a Long for each entry, and for each indexed attribute.

>
> But is this really important ? Just think about the 80/20 rule
>
>> (and it's much closer to a 95/5) : 20 percent of all entries will be 
>> accessed 80% of time. A good cache will usually gives you the same 
>> result (or close to) as if you put everything in memory. This is very 
>> basic IT theory...
>
>
> Yes - Totally - For search operations that type of tweaking is awesome
> and effective.  It applies to Supply Chain Applications some times, 
> and other times all the data is fair game.  For instance
> the application might be calculating Optimal Inventory Figures for all
> SKUs and and wants to do the run "Superfast", so it wants all the data
> in memory.

Maybe in Supply Chain Apps. But a Ldap Server is totally different. 
Don't think like if you only have a hammer... In your case, Ldap being 
very fast, even compared to a RDBMS, it might be interesting to use i. 
But you should also consider other elements, like the cost of writing in 
it, and the cost of a traversal (doing a full scan).

> Yes - I usually go for 1 first, then pump it up :-)
> I like the band too.

I prefer dead kennedys (http://www.deadkennedys.com/), but I'm a psycho %-]

Emmanuel

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.

Emmanuel Lecharny wrote:
> Ole Ersoy a écrit :
> 
> It's not theorical, it's real life.

Yes - The transactional data is :-)

I should have defined it a little better.

What I mean by that is what is the minimum
size of an entry considering only the fixed parts
or the keys of the entry effectively + anything
else that is needed to manage the entry that has
to be stored in memory.


> 
>>
>> So what can be done to minimize the size?
> 
> Nothing. 

So what that then means is that the structural component of
an entry is always minimized within ApacheDS?

So when I'm using JNDI to update an attribute, and the
key of my Attribute is 
"com.example.blah.blah.blah.blah.blah.blahblah.blahhhhhhh.something"
ApacheDS takes that key and turns it into the shortest possible number
it can before storing it?

But is this really important ? Just think about the 80/20 rule
> (and it's much closer to a 95/5) : 20 percent of all entries will be 
> accessed 80% of time. A good cache will usually gives you the same 
> result (or close to) as if you put everything in memory. This is very 
> basic IT theory...

Yes - Totally - For search operations that type of tweaking is awesome
and effective.  It applies to Supply Chain Applications some times, and 
other times all the data is fair game.  For instance
the application might be calculating Optimal Inventory Figures for all
SKUs and and wants to do the run "Superfast", so it wants all the data
in memory.
> 
>>
>> I think the Alex's mail and your link drive
>> into that, so I'll try to consolidate those materials.
> 
> One more principle you should try to follow, as a general rule of thumb 
> : KISS (http://en.wikipedia.org/wiki/KISS_principle). Complexity and 
> over-engineered solution simply fail to meet any target ... When it 
> works for 1000 entries, it should be possible to make it work for 10 
> 000. When it fails for 10 000, the odds are that it will also fail for 
> 1000...

Yes - I usually go for 1 first, then pump it up :-)
I like the band too.


> 
>>
>> If anyone else has thoughts, I'll take them!
> 
> Here you are !
>

Thanks Man,
- Ole



>>
>> Cheers,
>> - Ole
>>
> 
> 

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Emmanuel Lecharny <el...@gmail.com>.
Ole Ersoy a écrit :

> Hmmm - Emmanuel - you said something
> interesting.
>
> The size of each entry is roughly 1kb.

It will depend on many parameters, but you have to keep in mind that you 
have the external form - ie, user provided - and normalized form. For 
instance, a DN is kept in both forms. If a user has a JpegPhoto, 1kb is 
not a minimum, it's simply impossible... Even with objectclasses (around 
3 values), operational attributes (creation date, creator, etc), 
description, SN, cn, etc, this will be at least a few hundred of bytes. 
Just look at any mail VCF card size :)

>
> When I start working on the ADS Design Guide
> there will be a section in there on the theoretical
> minimum size of an entry.

It's not theorical, it's real life.

>
> So what can be done to minimize the size?

Nothing. But is this really important ? Just think about the 80/20 rule 
(and it's much closer to a 95/5) : 20 percent of all entries will be 
accessed 80% of time. A good cache will usually gives you the same 
result (or close to) as if you put everything in memory. This is very 
basic IT theory...

>
> I think the Alex's mail and your link drive
> into that, so I'll try to consolidate those materials.

One more principle you should try to follow, as a general rule of thumb 
: KISS (http://en.wikipedia.org/wiki/KISS_principle). Complexity and 
over-engineered solution simply fail to meet any target ... When it 
works for 1000 entries, it should be possible to make it work for 10 
000. When it fails for 10 000, the odds are that it will also fail for 
1000...

>
> If anyone else has thoughts, I'll take them!

Here you are !

>
> Cheers,
> - Ole
>


Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.
Hmmm - Emmanuel - you said something
interesting.

The size of each entry is roughly 1kb.

When I start working on the ADS Design Guide
there will be a section in there on the theoretical
minimum size of an entry.

So what can be done to minimize the size?

I think the Alex's mail and your link drive
into that, so I'll try to consolidate those materials.

If anyone else has thoughts, I'll take them!

Cheers,
- Ole





Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Emmanuel Lecharny <el...@gmail.com>.
On 4/6/07, Ole Ersoy <ol...@gmail.com> wrote:
>
> <snip/>
> A little later I need to break it down further
> so that I understand the whole process from
> a sequence Diagram view point.


The sequene diagram will not fit on a A-3 page (note  I didn't used A3 : A4
< A3 < A2... < A0 < A-1 < A-2 < A-3 etc...) (bwt, the largest available
format for paper is A0 atm, and Ai = 4xA(i+1) ...)

<snip/>
>
> <snip/>


> My goal is to keep the 200M entries in Memory.


Forget about the idea to store 200 M entries in memory. This is just
impossible. An entry is around 1Kbytes, and you won't ever have a 200 Gb mem
server ...


So I want to have them as compact as possible.
> When I write the entries using JNDI I'm using
>
> org.apache.tuscany.DASConfig.baseDN as the attribute
> key for one of the entry values.
>
> However I would much rather store something shorter
> than this in memory, like "1".
>
> I think you are saying the OID name alias,
> org.apache.tuscany.DASConfig.baseDN, gets switched
> out with the OID by the server.
>
> So instead of storing
>
> [org.apache.tuscany.DASConfig.baseDN, myValue]
>
> in memory, it stores:
>
> [1.24l2.3.4.2.4, myValue]


This is something that will be available in 2.0. We have discussed about it
with Alex those last three months, and I think we will have a level of
indirection. Basically, if you have 500 attributeTypes, pointing on 200M
entries, then you will have a N-N relation between AT and entries. This will
be solved with an intermediate table, with Longs in it :
AT-Long / entry-Long
where AT-long represent the AtttributeType unique ID into ADS and the very
same for Entry-long

This is explained here :
http://cwiki.apache.org/confluence/display/DIRxSRVx11/Backend

-- 
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)

Posted by Ole Ersoy <ol...@gmail.com>.
Alex,

So...um...Does it do the thing?

:-) Just kidding.

Wow - That's what I call an answer.

I think we need a performance design guide,
that's a sub book of a global design guide
for this type of "Smoookkkin" material.

A little later I need to break it down further
so that I understand the whole process from
a sequence Diagram view point.

Let me see if I can re-answer my question now
that I'm more enlightened.  I need to read
your material a few more times though for it to sink in
properly, so this is a "Trial" attempt.

I want to store 200M entries using the same
set of object classes to construct to create
the set of entry attributes.

One of the the entries I'm storing has OID
name alias org.apache.tuscany.DASConfig.baseDN
The OID for this AttributeType is 1.24l2.3.4.2.4 (Just made it up).

My goal is to keep the 200M entries in Memory.
So I want to have them as compact as possible.
When I write the entries using JNDI I'm using

org.apache.tuscany.DASConfig.baseDN as the attribute
key for one of the entry values.

However I would much rather store something shorter
than this in memory, like "1".

I think you are saying the OID name alias,
org.apache.tuscany.DASConfig.baseDN, gets switched
out with the OID by the server.

So instead of storing

[org.apache.tuscany.DASConfig.baseDN, myValue]

in memory, it stores:

[1.24l2.3.4.2.4, myValue]

Am I getting warmer?

Then the other thing I was thinking was
that 1.24l2.3.4.2.4 is still pretty long.

If I had an in memory partition for all the entries
and the entire set of entries only used say "500"
unique AttributeTypes, I think using
surrogate keys numbered from 1 to 500 would
result in a lot of memory savings and a performance increase.

Because we are only storing some number X ranging
from 1-500 200 Million times, rather than a bigger string like
1.24l2.3.4.2.4 200 Million times.

Then whenever a query takes place for
1.24l2.3.4.2.4, it's converted into "1",
and then one is used to look for entry attributes.

Does that make any sense?

Thanks,
- Ole

SNIP