You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@gmail.com> on 2011/10/08 08:27:45 UTC

RDN index, oneLevel and sublevel index merge

Hi guys,

Stefan started to modify the code to get rid of the oneLevel and 
subLevel index, which are more or less useless as we already have the 
hierarchy stored into the rdn index.

However, this rdn index is not good enough as is to be use as a 
replacement for the two other indexes. Its structure forbid us to easily 
retrieve the children from a known entry.

The current RDN index structure is :
<parentId, RDN> -> Entry ID

The key is a tuple containing the parent ID to be able to rebuild the DN.

The reverse index is :
Entry ID -> <parentId, RDN>

We don't have duplicated values.

Now, when we have an entry ID, there is no simple way to get the list of 
all the children for this entry.

We will have to add a third index to deal with such searches :
ParentId -> <entryId, ....>

which will list all the children of a specific entry.

I'm going to investigate around this idea i the next few days.

Thoughts ?

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: RDN index, oneLevel and sublevel index merge

Posted by Emmanuel Lecharny <el...@gmail.com>.

On 10/19/11 12:01 AM, Stefan Seelmann wrote:
> On Sun, Oct 9, 2011 at 7:00 PM, Emmanuel Lecharny<el...@apache.org>  wrote:
>>
>>
>>> I try to update the branch tomorrow, but can't promise anything.
>>
>> ok, many thanks ! I tested the branch, there are some failing tests in
>> core-integ, but had little time to investigate.
> Yes, some tests are still failing. For example I observed some kind of
> deadlock when recursively deleting entries in a deep nested tree.
>
> But unfortunately I didn't manage to continue with the branch. I tried
> to merge the recent changes from trunk but come conflicts occured. I
> think it doesn't make sense to continue for me because due the the
> high trunk activity it takes too much time to merge in the trunk
> changes.

Yeah, unfortunatly, we have had to move around a hell lot of packages in 
order to do the OSGi changes. I think it's now stabilizing, but we 
probably will continue in this area for a few more days, so it's better 
not to try to merge anything yet.


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: RDN index, oneLevel and sublevel index merge

Posted by Stefan Seelmann <se...@apache.org>.

On Sun, Oct 9, 2011 at 7:00 PM, Emmanuel Lecharny <el...@apache.org> wrote:
>
>
> On Sat, Oct 8, 2011 at 2:52 PM, Stefan Seelmann <se...@apache.org> wrote:
>>
>> Hi Emmanuel,
>>
>> On Sat, Oct 8, 2011 at 8:27 AM, Emmanuel Lecharny <el...@gmail.com>
>> wrote:
>> > Hi guys,
>> >
>> > Stefan started to modify the code to get rid of the oneLevel and
>> > subLevel
>> > index, which are more or less useless as we already have the hierarchy
>> > stored into the rdn index.
>> >
>> > However, this rdn index is not good enough as is to be use as a
>> > replacement
>> > for the two other indexes. Its structure forbid us to easily retrieve
>> > the
>> > children from a known entry.
>> >
>> > The current RDN index structure is :
>> > <parentId, RDN> -> Entry ID
>> >
>> > The key is a tuple containing the parent ID to be able to rebuild the
>> > DN.
>> >
>> > The reverse index is :
>> > Entry ID -> <parentId, RDN>
>> >
>> > We don't have duplicated values.
>> >
>> > Now, when we have an entry ID, there is no simple way to get the list of
>> > all
>> > the children for this entry.
>> >
>> > We will have to add a third index to deal with such searches :
>> > ParentId -> <entryId, ....>
>> >
>> > which will list all the children of a specific entry.
>> >
>> > I'm going to investigate around this idea i the next few days.
>> >
>> > Thoughts ?
>>
>> Do we really need the additional index? My idea was to avoid this
>> additional index. Therefor I introduced a new cursor, the
>> RdnIndexTreeCursor [1] together with a RdnIndexHelper [2].
>>
>> Let's take the following example DIT:
>>
>> dc=example,dc=com (1)
>> * ou=people (3)
>> ** uid=elecharny (4)
>> ** uid=seelmann (6)
>> * ou=groups (2)
>> ** ou=directory (5)
>> ** ou=mina (7)
>>
>> The corresponding RDN index looks like this:
>>
>> <parentId, RDN> -> <entryId>
>> 0,dc=example,dc=com -> 1
>> 1,ou=groups -> 2
>> 1,ou=people -> 3
>> 2,ou=directory -> 5
>> 2,ou=mina -> 7
>> 3,uid=elecharny -> 4
>> 3,uid=seelmann -> 6
>>
>> One-level traversal works as follows: the RdnIndexTreeCursor is
>> initialzed with the parentId (say 2 for ou=groups). To be able to
>> limit the cursor's range a start element and a stop element are
>> created. Both are ParentIdAndRdn objects with the same parentId (in
>> our example 2). The start element's RDN is null. The stop elements's
>> RDN is a special RDN zzz=zzz. Those ements are used to position the
>> cursor with first()/beforeFirst()/last()/afterLast(). The next() and
>> previous() methods check that the range isn't exceeded.
>
>
> Do you mean that thos elements are created and injected in the RDN index for
> each entry having children ?

No, they are just used just used to position the cursor.

>> Sub-Level traversal is a bit more complex but still straight-forward.
>> As the search base is included in the search results the base
>> ParentIdAndRdn object is used as start and stop element (say
>> <1,ou=groups>). Depth-first traversal is used and a stack of cursors
>> is maintained. When advancing the cursor it checks if the current
>> entry has children, if that's the case a sub-cursor with a limited
>> range (like the one-level) is opened and put to the stack.
>>
>> The RdnIndexTreeCursor is used by OneLevelScopeCursor and SubScopeCursor.
>>
>> The OneLevelScopeEvaluator and SubScopeEvaluator use the
>> RdnIndexHelper, especially the methods isDirectDescendantOf(ID
>> ancestorId, ID descendantId) and isDescendantOf(ID ancestorId, ID
>> descendantId). They just do reverse lookups from the descendantId,
>> till the ancestor is found. It is possible to cache those lookups for
>> branch entries so that only one lookup is required.
>>
>> Finally the DefaultOptimizer uses the RdnIndexHelper to determine the
>> sub-level count. Therefor the ParentIdAndRdn now contains two
>> additional fields oneLevelCount and subLevelCount that are
>> incremented/decremented when adding/removing/moving entries to/from
>> the partition.
>>
>> Let's talk about the downside: This solution should work for flat
>> trees, but not for very deep trees. For very deep trees the
>> RdnIndexTreeCursor needs to maintain a large number of opened cursors.
>
>
> we can close the cursors when they are not used anyöore. Doing a width first
> search, we will only create as many cursor as we need for one single level.
> This should not be overkilling.
>
>>
>> Also the caching of the isDescendantOf information would require more
>> memory for deep trees.
>>
>> Do you think that is a feasible approach?
>
>
> it should work...
>>
>> I try to update the branch tomorrow, but can't promise anything.
>
>
> ok, many thanks ! I tested the branch, there are some failing tests in
> core-integ, but had little time to investigate.

Yes, some tests are still failing. For example I observed some kind of
deadlock when recursively deleting entries in a deep nested tree.

But unfortunately I didn't manage to continue with the branch. I tried
to merge the recent changes from trunk but come conflicts occured. I
think it doesn't make sense to continue for me because due the the
high trunk activity it takes too much time to merge in the trunk
changes.

Kind Regards,
Stefan

Re: RDN index, oneLevel and sublevel index merge

Posted by Stefan Seelmann <se...@apache.org>.

Hi Emmanuel,

On Sat, Oct 8, 2011 at 8:27 AM, Emmanuel Lecharny <el...@gmail.com> wrote:
> Hi guys,
>
> Stefan started to modify the code to get rid of the oneLevel and subLevel
> index, which are more or less useless as we already have the hierarchy
> stored into the rdn index.
>
> However, this rdn index is not good enough as is to be use as a replacement
> for the two other indexes. Its structure forbid us to easily retrieve the
> children from a known entry.
>
> The current RDN index structure is :
> <parentId, RDN> -> Entry ID
>
> The key is a tuple containing the parent ID to be able to rebuild the DN.
>
> The reverse index is :
> Entry ID -> <parentId, RDN>
>
> We don't have duplicated values.
>
> Now, when we have an entry ID, there is no simple way to get the list of all
> the children for this entry.
>
> We will have to add a third index to deal with such searches :
> ParentId -> <entryId, ....>
>
> which will list all the children of a specific entry.
>
> I'm going to investigate around this idea i the next few days.
>
> Thoughts ?

Do we really need the additional index? My idea was to avoid this
additional index. Therefor I introduced a new cursor, the
RdnIndexTreeCursor [1] together with a RdnIndexHelper [2].

Let's take the following example DIT:

dc=example,dc=com (1)
* ou=people (3)
** uid=elecharny (4)
** uid=seelmann (6)
* ou=groups (2)
** ou=directory (5)
** ou=mina (7)

The corresponding RDN index looks like this:

<parentId, RDN> -> <entryId>
0,dc=example,dc=com -> 1
1,ou=groups -> 2
1,ou=people -> 3
2,ou=directory -> 5
2,ou=mina -> 7
3,uid=elecharny -> 4
3,uid=seelmann -> 6

One-level traversal works as follows: the RdnIndexTreeCursor is
initialzed with the parentId (say 2 for ou=groups). To be able to
limit the cursor's range a start element and a stop element are
created. Both are ParentIdAndRdn objects with the same parentId (in
our example 2). The start element's RDN is null. The stop elements's
RDN is a special RDN zzz=zzz. Those ements are used to position the
cursor with first()/beforeFirst()/last()/afterLast(). The next() and
previous() methods check that the range isn't exceeded.

Sub-Level traversal is a bit more complex but still straight-forward.
As the search base is included in the search results the base
ParentIdAndRdn object is used as start and stop element (say
<1,ou=groups>). Depth-first traversal is used and a stack of cursors
is maintained. When advancing the cursor it checks if the current
entry has children, if that's the case a sub-cursor with a limited
range (like the one-level) is opened and put to the stack.

The RdnIndexTreeCursor is used by OneLevelScopeCursor and SubScopeCursor.

The OneLevelScopeEvaluator and SubScopeEvaluator use the
RdnIndexHelper, especially the methods isDirectDescendantOf(ID
ancestorId, ID descendantId) and isDescendantOf(ID ancestorId, ID
descendantId). They just do reverse lookups from the descendantId,
till the ancestor is found. It is possible to cache those lookups for
branch entries so that only one lookup is required.

Finally the DefaultOptimizer uses the RdnIndexHelper to determine the
sub-level count. Therefor the ParentIdAndRdn now contains two
additional fields oneLevelCount and subLevelCount that are
incremented/decremented when adding/removing/moving entries to/from
the partition.

Let's talk about the downside: This solution should work for flat
trees, but not for very deep trees. For very deep trees the
RdnIndexTreeCursor needs to maintain a large number of opened cursors.
Also the caching of the isDescendantOf information would require more
memory for deep trees.

Do you think that is a feasible approach?

I try to update the branch tomorrow, but can't promise anything.

Kind Regards,
Stefan

[1] http://svn.apache.org/repos/asf/directory/apacheds/branches/one-sub-level-index-removal/xdbm-partition/src/main/java/org/apache/directory/server/xdbm/search/impl/RdnIndexTreeCursor.java
[2] http://svn.apache.org/repos/asf/directory/apacheds/branches/one-sub-level-index-removal/xdbm-partition/src/main/java/org/apache/directory/server/xdbm/search/impl/RdnIndexHelper.java
[3] http://svn.apache.org/repos/asf/directory/apacheds/branches/one-sub-level-index-removal/xdbm-partition/src/main/java/org/apache/directory/server/xdbm/ParentIdAndRdn.java

Re: RDN index, oneLevel and sublevel index merge

Posted by Howard Chu <hy...@symas.com>.

Howard Chu wrote:
> Emmanuel Lecharny wrote:
>> Hi guys,
>>
>> Stefan started to modify the code to get rid of the oneLevel and
>> subLevel index, which are more or less useless as we already have the
>> hierarchy stored into the rdn index.
>>
>> However, this rdn index is not good enough as is to be use as a
>> replacement for the two other indexes. Its structure forbid us to easily
>> retrieve the children from a known entry.
>>
>> The current RDN index structure is :
>> <parentId, RDN>   ->   Entry ID
>>
>> The key is a tuple containing the parent ID to be able to rebuild the DN.
>>
>> The reverse index is :
>> Entry ID ->   <parentId, RDN>
>>
>> We don't have duplicated values.
>>
>> Now, when we have an entry ID, there is no simple way to get the list of
>> all the children for this entry.
>>
>> We will have to add a third index to deal with such searches :
>> ParentId ->   <entryId, ....>
>>
>> which will list all the children of a specific entry.
>>
>> I'm going to investigate around this idea i the next few days.
>>
>> Thoughts ?
>
> Currently in OpenLDAP back-hdb and back-mdb, the DN index contains
>
> Entry ID ->  <parentID, RDN>  [,<child ID, RDN>  ...]
>
> So each entry's RDN is stored twice, once under its own entryID, and once
> under its parent's entryID. This allows top-down DN to ID lookups and
> bottom-up ID to DN lookups from a single index.
>
> (This is a bit different from the original layout described in 2003
> http://www.openldap.org/conf/odd-sfo-2003/proceedings.html )
>
I forgot to mention... Onelevel lookups are obviously trivial with this 
scheme. back-hdb and back-mdb differ in how they perform subordinate lookups. 
back-hdb recursively builds a list of subordinate members (and caches this 
list for future queries). back-hdb is ridiculously slow for subordinate 
lookups that haven't been cached yet.

back-mdb is a bit more straightforward; given a list of candidate entry IDs, 
it simply walks up from the entryID until it hits the base of the desired 
subtree (meaning candidate is present in the result set) or until it hits the 
root of the tree (candidate is not present). As a further refinement (not yet 
implemented) if a search yields an all-entries candidate list (unindexed 
search), it can just recursively descend from the base of the desired subtree 
and ignore the candidate list.

It turns out that building the cached IDLs in back-hdb is much slower than 
just reading the index in back-mdb. (Not to mention being a memory pig.)

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Re: RDN index, oneLevel and sublevel index merge

Posted by Howard Chu <hy...@symas.com>.

Emmanuel Lecharny wrote:
> Hi guys,
>
> Stefan started to modify the code to get rid of the oneLevel and
> subLevel index, which are more or less useless as we already have the
> hierarchy stored into the rdn index.
>
> However, this rdn index is not good enough as is to be use as a
> replacement for the two other indexes. Its structure forbid us to easily
> retrieve the children from a known entry.
>
> The current RDN index structure is :
> <parentId, RDN>  ->  Entry ID
>
> The key is a tuple containing the parent ID to be able to rebuild the DN.
>
> The reverse index is :
> Entry ID ->  <parentId, RDN>
>
> We don't have duplicated values.
>
> Now, when we have an entry ID, there is no simple way to get the list of
> all the children for this entry.
>
> We will have to add a third index to deal with such searches :
> ParentId ->  <entryId, ....>
>
> which will list all the children of a specific entry.
>
> I'm going to investigate around this idea i the next few days.
>
> Thoughts ?

Currently in OpenLDAP back-hdb and back-mdb, the DN index contains

Entry ID -> <parentID, RDN> [,<child ID, RDN> ...]

So each entry's RDN is stored twice, once under its own entryID, and once 
under its parent's entryID. This allows top-down DN to ID lookups and 
bottom-up ID to DN lookups from a single index.

(This is a bit different from the original layout described in 2003 
http://www.openldap.org/conf/odd-sfo-2003/proceedings.html )

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/