You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2012/04/12 15:01:45 UTC

[Index] SubLevel removal

Hi guys,
I'm currently working on the SubLevelIndex removal. This is slightly 
more complex than the OneLevel  index, and there are some choice to be 
made in this area.

Basically, the idea is that we will first search the entry which is the 
parent, then start fetching all the descendant from this point.

Considering we have such a tree :

cn=A
   cn=B
   cn=C
     cn=D
       cn=G
     cn=E
   cn=F
     cn=H

then a search for entries using a SUBTREE scope and a base DN 'cn=C, 
cn=A' will return the following set of entries (the filter is 
objectClass=*, to keep it simple) :
{
   cn=C,cn=A
   cn=D,cn=C,cn=A
   cn=G,cn=D,cn=C,cn=A
   cn=E,cn=C,cn=A
}

Note that the same search with a ONE scope would return cn=D,cn=C,cn=A 
and cn=E,cn=C,cn=A, excluding the cn=C,cn=A entry, which is the base for 
this search).

Now, we have two ways to get those entries :
- depth-first traversal. We fetch every entry, and if it has some 
children, then we go down one level, and so on recursively, until we 
have exhausted all the descendant. The consequence is that the entries 
will be returned in this order :

   cn=C,cn=A
   cn=D,cn=C,cn=A
   cn=G,cn=D,cn=C,cn=A
   cn=E,cn=C,cn=A

- Breadth-first traversal. This time, we exhaust all the children for 
the current level, and then we go down one level, read all the entries, 
etc. We never go down one level if all the entries at the same level 
have not been processed. The entries will be returned in this order :

   cn=C,cn=A
   cn=D,cn=C,cn=A
   cn=E,cn=C,cn=A
   cn=G,cn=D,cn=C,cn=A

The problem with the breadth-first approach is that it puts way more 
pressure on the memory, as we have to keep in memory all the entries 
that have children. With the depth-first approach, we just proceed with 
a new cursor when we have at least one children, so we will have as many 
cursors in memory as the tree's depth (so if we have a 10 levels tree, 
we will keep 10 cursors max). OTOH, the order might be a bit strange 
(even if there is no guarantee whatsoever that the server will return 
the entry in any given order).

IMO, the depth-first approach is the one which offers the best balance 
between performance and memory consumption. Also the return order is not 
that critical assuming that we have implemented the Sort control (which 
is not yet part of our implementation).

Any better idea, or any comments ?

Thanks !

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: [Index] SubLevel removal

Posted by Alex Karasulu <ak...@apache.org>.
On Thu, Apr 12, 2012 at 4:01 PM, Emmanuel Lécharny <el...@gmail.com>wrote:

> Hi guys,
> I'm currently working on the SubLevelIndex removal. This is slightly more
> complex than the OneLevel  index, and there are some choice to be made in
> this area.
>
> Basically, the idea is that we will first search the entry which is the
> parent, then start fetching all the descendant from this point.
>
> Considering we have such a tree :
>
> cn=A
>  cn=B
>  cn=C
>    cn=D
>      cn=G
>    cn=E
>  cn=F
>    cn=H
>
> then a search for entries using a SUBTREE scope and a base DN 'cn=C, cn=A'
> will return the following set of entries (the filter is objectClass=*, to
> keep it simple) :
> {
>  cn=C,cn=A
>  cn=D,cn=C,cn=A
>  cn=G,cn=D,cn=C,cn=A
>  cn=E,cn=C,cn=A
> }
>
> Note that the same search with a ONE scope would return cn=D,cn=C,cn=A and
> cn=E,cn=C,cn=A, excluding the cn=C,cn=A entry, which is the base for this
> search).
>
> Now, we have two ways to get those entries :
> - depth-first traversal. We fetch every entry, and if it has some
> children, then we go down one level, and so on recursively, until we have
> exhausted all the descendant. The consequence is that the entries will be
> returned in this order :
>
>  cn=C,cn=A
>  cn=D,cn=C,cn=A
>  cn=G,cn=D,cn=C,cn=A
>  cn=E,cn=C,cn=A
>
> - Breadth-first traversal. This time, we exhaust all the children for the
> current level, and then we go down one level, read all the entries, etc. We
> never go down one level if all the entries at the same level have not been
> processed. The entries will be returned in this order :
>
>  cn=C,cn=A
>  cn=D,cn=C,cn=A
>  cn=E,cn=C,cn=A
>  cn=G,cn=D,cn=C,cn=A
>
> The problem with the breadth-first approach is that it puts way more
> pressure on the memory, as we have to keep in memory all the entries that
> have children. With the depth-first approach, we just proceed with a new
> cursor when we have at least one children, so we will have as many cursors
> in memory as the tree's depth (so if we have a 10 levels tree, we will keep
> 10 cursors max). OTOH, the order might be a bit strange (even if there is
> no guarantee whatsoever that the server will return the entry in any given
> order).
>
> IMO, the depth-first approach is the one which offers the best balance
> between performance and memory consumption. Also the return order is not
> that critical assuming that we have implemented the Sort control (which is
> not yet part of our implementation).
>
> Any better idea, or any comments ?
>

My initial thought right after reading the email was DFS definitely because
of the following reasons:

(1) As you say order is not important unless a sort control is used and
that's a separate issue
(2) DFS is less memory intensive than BFS since it allows the GC to free up
allocations when stack frames are popped.
(3) Most DIT's today are rather flat so the DFS cursors will not grow that
large and the BFS approach can blow out memory when you have millions of
sibling entries under a container.

Then I got the idea of making this configurable so we can play with
parameters but that might not be worth the pain. Thoughts?

-- 
Best Regards,
-- Alex