You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Stefan Seelmann <se...@apache.org> on 2011/08/15 17:59:51 UTC

Re: HBase partition integration in trunks ?

On Sun, Jul 24, 2011 at 10:43 AM, Stefan Seelmann <se...@apache.org> wrote:
> On Sun, Jul 24, 2011 at 12:43 AM, Emmanuel Lecharny <el...@gmail.com> wrote:
>> I was just wondering if we should not add the HBase partition in trunks, now
>> that the partition code has been cleaned up ?
>>
>> What's is the status for this code ?
>
> The last time I updated the code was back in November 2010. In the
> meantime the ApacheDS and HBase code changed so there is some work to
> do. HBase is now at version 0.90.3 (0.90.4 is in the pipeline), the
> API is stable and available in maven central which is a good thing. I
> can try to update the code next week, then I can give a better status
> report.

As always it took a bit longer :-)

I updated the HBase partition code in my sandbox. Now HBase 0.90.3 API
is used. Also changes in shared and partition API are updated. Basic
tests are also running. So far so good.

Now I have to update the parts that are a bit special, let me explain:
In HBase partition I didn't use one-level and sub-level indices, but
use the RDN index table instead. I also extended the search engine in
that way that one-level and sub-level cursors get the search filter in
order to perform filtering within the store instead of returning all
candidates and evaluate them.

So instead of updating those parts I'd prefer to make such changes in
the xdbm-partition module.

The first step would be to remove the one-level and sub-level indices
and use the RDN index instead. I'll write a separate mail with a
proposal.

Kind Regards,
Stefan

Re: HBase partition integration in trunks ?

Posted by Stefan Seelmann <se...@apache.org>.
On Tue, Aug 16, 2011 at 9:53 AM, Emmanuel Lécharny <el...@apache.org> wrote:
> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>
>> Now I have to update the parts that are a bit special, let me explain:
>> In HBase partition I didn't use one-level and sub-level indices, but
>> use the RDN index table instead. I also extended the search engine in
>> that way that one-level and sub-level cursors get the search filter in
>> order to perform filtering within the store instead of returning all
>> candidates and evaluate them.
>
> Some toughts about this one-level/sub-level index.
>
> Using the Rdn index makes perfect sense : we have the Rdn -> parent relation
> plus the parent -> children relation in this index, so there is no need to
> have a one level index (all the children are already listed in the RDN index
> for a specific entry). I'm a bit more concerned about the sub-level
> processing : we have to recurse on all the children to get all the
> candidates. That's fine, we can easily implement that (and you already did),
> but what concerns me is that we don't have the count of all the entries, we
> will have to compute them. This count is necessary in the search engine to
> select the index we will use to walk the entries.
>
> One solution would be to store two more elements in the ParentIdAndRdn data
> structure : the number of children directly below the RDN, and the number of
> children and descendant. That would probably solve the issue I'm mentioning.

Yes, that is exactly what I did for the HBase partition. I also did
some changes in the xdbm-partition code. There are two counters that
track the one-level and sub-level children count. I'll create a branch
and commit what I have tonight.

> Of course, that also means we wil have to update all the RDN hierarchy from
> top to bottom (but affecting only the RDN part of the entry DN) each time we
> add/move/delete an entry. Note that we already do that for the oneLevel and
> Sublevel index.

Yes. Only the ParentIdAndRdn objects in the RDN index needs to be
updated, beginning from the modified entry up till the root. If we
expect a flat tree then only few updates are necessary. I also think
that those "branches" are hot-spots, I mean are often accessed, and
should be cached.

A special case is the rename/move operation. In that case we can't
just drop and add the RDN index entry because then we would loose the
counters. Instead the counters must be copied to the new RDN index
entry.

> All in all, I do think this is feasable, and you probably already have
> implemented such logic in the HBase partition.
>
> Can you tell me if what I wrote above makes sense for HBase but also for the
> whole system ?

Yes, I think it makes sense for all kind of partitons.

> If we could get rid of the one-level/sub-level index, we would speed-up the
> add/move/delete operation greatly (as we will spare two index update),
> saving probably 25% of the time needed to update the backend (we will just
> have 5 index to update instead of 7). It might also speed up the search
> marginally, as we won't have to do  look-up in the one-level or sub-level
> index to build the scope filter.

Kind Regards,
Stefan

Re: HBase partition integration in trunks ?

Posted by Alex Karasulu <ak...@apache.org>.
On Tue, Aug 16, 2011 at 4:38 PM, Emmanuel Lécharny <el...@apache.org>wrote:

> On 8/16/11 3:27 PM, Alex Karasulu wrote:
>
>> On Tue, Aug 16, 2011 at 10:53 AM, Emmanuel Lécharny<el...@apache.org>
>> **wrote:
>>
>>  On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>>
>>>  Now I have to update the parts that are a bit special, let me explain:
>>>> In HBase partition I didn't use one-level and sub-level indices, but
>>>> use the RDN index table instead. I also extended the search engine in
>>>> that way that one-level and sub-level cursors get the search filter in
>>>> order to perform filtering within the store instead of returning all
>>>> candidates and evaluate them.
>>>>
>>>>  Some toughts about this one-level/sub-level index.
>>>
>>> Using the Rdn index makes perfect sense : we have the Rdn ->  parent
>>> relation plus the parent ->  children relation in this index, so there is
>>> no
>>> need to have a one level index (all the children are already listed in
>>> the
>>> RDN index for a specific entry). I'm a bit more concerned about the
>>> sub-level processing : we have to recurse on all the children to get all
>>> the
>>> candidates. That's fine, we can easily implement that (and you already
>>> did),
>>> but what concerns me is that we don't have the count of all the entries,
>>> we
>>> will have to compute them. This count is necessary in the search engine
>>> to
>>> select the index we will use to walk the entries.
>>>
>>> One solution would be to store two more elements in the ParentIdAndRdn
>>> data
>>> structure : the number of children directly below the RDN, and the number
>>> of
>>> children and descendant. That would probably solve the issue I'm
>>> mentioning.
>>> Of course, that also means we wil have to update all the RDN hierarchy
>>> from
>>> top to bottom (but affecting only the RDN part of the entry DN) each time
>>> we
>>> add/move/delete an entry. Note that we already do that for the oneLevel
>>> and
>>> Sublevel index.
>>>
>>>
>>>  Good idea Emmanuel.
>>
>
> Note that I just rephrased Stefan's idea here. It's not mine initially.
>
>
>> This would be a neat solution to handling the sub level count problem.
>> Let's
>> experiment with this and see if it does intact lead to a speedup which I
>> think it should but it's good to just see. I wish we had a nice lab for
>> this.
>>
> HBase work done by Stefa is already an excellent lab :)
>
>
Yes it is a very good exercise which shows the interfaces and design are
holding up pretty well if these relatively minor issues are all we have to
worry about.

But what I meant was a lab where machines are ready to run tests on our
nightly builds not just the experience of writing this partition :-). It
would be neat to see the progression with theses changes over time.

-- 
Best Regards,
-- Alex

Re: HBase partition integration in trunks ?

Posted by Emmanuel Lécharny <el...@apache.org>.
On 8/16/11 3:27 PM, Alex Karasulu wrote:
> On Tue, Aug 16, 2011 at 10:53 AM, Emmanuel Lécharny<el...@apache.org>wrote:
>
>> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>
>>> Now I have to update the parts that are a bit special, let me explain:
>>> In HBase partition I didn't use one-level and sub-level indices, but
>>> use the RDN index table instead. I also extended the search engine in
>>> that way that one-level and sub-level cursors get the search filter in
>>> order to perform filtering within the store instead of returning all
>>> candidates and evaluate them.
>>>
>> Some toughts about this one-level/sub-level index.
>>
>> Using the Rdn index makes perfect sense : we have the Rdn ->  parent
>> relation plus the parent ->  children relation in this index, so there is no
>> need to have a one level index (all the children are already listed in the
>> RDN index for a specific entry). I'm a bit more concerned about the
>> sub-level processing : we have to recurse on all the children to get all the
>> candidates. That's fine, we can easily implement that (and you already did),
>> but what concerns me is that we don't have the count of all the entries, we
>> will have to compute them. This count is necessary in the search engine to
>> select the index we will use to walk the entries.
>>
>> One solution would be to store two more elements in the ParentIdAndRdn data
>> structure : the number of children directly below the RDN, and the number of
>> children and descendant. That would probably solve the issue I'm mentioning.
>> Of course, that also means we wil have to update all the RDN hierarchy from
>> top to bottom (but affecting only the RDN part of the entry DN) each time we
>> add/move/delete an entry. Note that we already do that for the oneLevel and
>> Sublevel index.
>>
>>
> Good idea Emmanuel.

Note that I just rephrased Stefan's idea here. It's not mine initially.
>
> This would be a neat solution to handling the sub level count problem. Let's
> experiment with this and see if it does intact lead to a speedup which I
> think it should but it's good to just see. I wish we had a nice lab for
> this.
HBase work done by Stefa is already an excellent lab :)
>
> Also we need to update the docs a bit to show all the changes that took
> place. We still show the old upon and ndn indices of old. Something at the
> heart of the server like this should always be in sync with the
> documentation.
yes...


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: HBase partition integration in trunks ?

Posted by Alex Karasulu <ak...@apache.org>.
On Tue, Aug 16, 2011 at 10:53 AM, Emmanuel Lécharny <el...@apache.org>wrote:

> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>
>> Now I have to update the parts that are a bit special, let me explain:
>> In HBase partition I didn't use one-level and sub-level indices, but
>> use the RDN index table instead. I also extended the search engine in
>> that way that one-level and sub-level cursors get the search filter in
>> order to perform filtering within the store instead of returning all
>> candidates and evaluate them.
>>
> Some toughts about this one-level/sub-level index.
>
> Using the Rdn index makes perfect sense : we have the Rdn -> parent
> relation plus the parent -> children relation in this index, so there is no
> need to have a one level index (all the children are already listed in the
> RDN index for a specific entry). I'm a bit more concerned about the
> sub-level processing : we have to recurse on all the children to get all the
> candidates. That's fine, we can easily implement that (and you already did),
> but what concerns me is that we don't have the count of all the entries, we
> will have to compute them. This count is necessary in the search engine to
> select the index we will use to walk the entries.
>
> One solution would be to store two more elements in the ParentIdAndRdn data
> structure : the number of children directly below the RDN, and the number of
> children and descendant. That would probably solve the issue I'm mentioning.
> Of course, that also means we wil have to update all the RDN hierarchy from
> top to bottom (but affecting only the RDN part of the entry DN) each time we
> add/move/delete an entry. Note that we already do that for the oneLevel and
> Sublevel index.
>
>
Good idea Emmanuel.

This would be a neat solution to handling the sub level count problem. Let's
experiment with this and see if it does intact lead to a speedup which I
think it should but it's good to just see. I wish we had a nice lab for
this.

Also we need to update the docs a bit to show all the changes that took
place. We still show the old upon and ndn indices of old. Something at the
heart of the server like this should always be in sync with the
documentation.

Best Regards,
-- Alex

Re: HBase partition integration in trunks ?

Posted by Stefan Seelmann <se...@apache.org>.
On Tue, Aug 16, 2011 at 3:14 PM, Kiran Ayyagari <ka...@apache.org> wrote:

>> If you think about what is done when we add an entry in the current code
>> base :
>> - add the entry in the MasterTable
>> - add the RDN/parent into the RdnIndex
>> - update the one-level index with the newly added entry reference,
>> increasing the number of children for the parent
>> - for each RDN in the parent, update the sub-level index with the newly
>> added entry reference, increasing the number of children for the parent
>>
>> If we compare what we would do if we remove the one-level/sub-level index :
>> - add the entry in the MasterTable
>> - add the RDN/parent into the RdnIndex
>> - for each RDN in the parent, update the rdn index with the newly added
>> entry reference, increasing the number of children for the parent
>>
> this count needs to be updated in more than one ParentAndRdn entry(for
> sublevel indexing feature only) depending on the level
> at which the child entry was added(cause each entry in the ancestor
> chain is a valid base to perform sublevel search)

Correct.

> and also when it comes to fetch all the sublevel entries we need to
> scan recursively.

Right. I used depth-first tree traversal. So in that case the maximum
number of open cursors is the max. depth of the DIT.

Kind Regards,
Stefan

Re: HBase partition integration in trunks ?

Posted by Alex Karasulu <ak...@apache.org>.
On Tue, Aug 16, 2011 at 4:33 PM, Emmanuel Lécharny <el...@apache.org>wrote:

> On 8/16/11 3:14 PM, Kiran Ayyagari wrote:
>
>> On Tue, Aug 16, 2011 at 6:27 PM, Emmanuel Lecharny<el...@gmail.com>
>>  wrote:
>>
>>> On 8/16/11 2:22 PM, Kiran Ayyagari wrote:
>>>
>>>> On Tue, Aug 16, 2011 at 1:23 PM, Emmanuel Lécharny<elecharny@apache.org
>>>> >
>>>>  wrote:
>>>>
>>>>> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>>>>
>>>>>> Now I have to update the parts that are a bit special, let me explain:
>>>>>> In HBase partition I didn't use one-level and sub-level indices, but
>>>>>> use the RDN index table instead. I also extended the search engine in
>>>>>> that way that one-level and sub-level cursors get the search filter in
>>>>>> order to perform filtering within the store instead of returning all
>>>>>> candidates and evaluate them.
>>>>>>
>>>>> Some toughts about this one-level/sub-level index.
>>>>>
>>>>> Using the Rdn index makes perfect sense : we have the Rdn ->    parent
>>>>> relation
>>>>> plus the parent ->    children relation in this index, so there is no
>>>>> need
>>>>> to
>>>>> have a one level index (all the children are already listed in the RDN
>>>>> index
>>>>> for a specific entry). I'm a bit more concerned about the sub-level
>>>>> processing : we have to recurse on all the children to get all the
>>>>> candidates. That's fine, we can easily implement that (and you already
>>>>> did),
>>>>> but what concerns me is that we don't have the count of all the
>>>>> entries,
>>>>> we
>>>>> will have to compute them. This count is necessary in the search engine
>>>>> to
>>>>> select the index we will use to walk the entries.
>>>>>
>>>>> One solution would be to store two more elements in the ParentIdAndRdn
>>>>> data
>>>>> structure : the number of children directly below the RDN, and the
>>>>> number
>>>>> of
>>>>> children and descendant. That would probably solve the issue I'm
>>>>> mentioning.
>>>>> Of course, that also means we wil have to update all the RDN hierarchy
>>>>> from
>>>>> top to bottom (but affecting only the RDN part of the entry DN) each
>>>>> time
>>>>> we
>>>>> add/move/delete an entry. Note that we already do that for the oneLevel
>>>>> and
>>>>> Sublevel index.
>>>>>
>>>>>  Just to make a point:
>>>> I think, in the case of achieving SubLevel index evaluation with RDN
>>>> index it becomes a costly and complex operation
>>>> (recursive scanning and updating) where as with the current sublevel
>>>> index it takes O(1) to fetch all the sublevel children of
>>>> an entry.
>>>> Not sure if HBase has any features to solve in an efficient manner
>>>>
>>> If you think about what is done when we add an entry in the current code
>>> base :
>>> - add the entry in the MasterTable
>>> - add the RDN/parent into the RdnIndex
>>> - update the one-level index with the newly added entry reference,
>>> increasing the number of children for the parent
>>> - for each RDN in the parent, update the sub-level index with the newly
>>> added entry reference, increasing the number of children for the parent
>>>
>>> If we compare what we would do if we remove the one-level/sub-level index
>>> :
>>> - add the entry in the MasterTable
>>> - add the RDN/parent into the RdnIndex
>>> - for each RDN in the parent, update the rdn index with the newly added
>>> entry reference, increasing the number of children for the parent
>>>
>>>  this count needs to be updated in more than one ParentAndRdn entry(for
>> sublevel indexing feature only) depending on the level
>> at which the child entry was added(cause each entry in the ancestor
>> chain is a valid base to perform sublevel search)
>> and also when it comes to fetch all the sublevel entries we need to
>> scan recursively.
>>
> Right, and this is why I added a 'for each RDN ' in the last operation.
> Btw, this is exactly the same kind of cost we already have to pay to update
> the sub-level index.


Yep the costs are not very different in terms of processing but the IO
advantage as you mentioned is huge. I think having most of the Rdn index in
cache this will fly big time.

Also just having numbers with some simple tests will show us the real facts.
But this approach sound healthy.

Best Regards,
-- Alex

Re: HBase partition integration in trunks ?

Posted by Emmanuel Lécharny <el...@apache.org>.
On 8/16/11 3:14 PM, Kiran Ayyagari wrote:
> On Tue, Aug 16, 2011 at 6:27 PM, Emmanuel Lecharny<el...@gmail.com>  wrote:
>> On 8/16/11 2:22 PM, Kiran Ayyagari wrote:
>>> On Tue, Aug 16, 2011 at 1:23 PM, Emmanuel Lécharny<el...@apache.org>
>>>   wrote:
>>>> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>>>> Now I have to update the parts that are a bit special, let me explain:
>>>>> In HBase partition I didn't use one-level and sub-level indices, but
>>>>> use the RDN index table instead. I also extended the search engine in
>>>>> that way that one-level and sub-level cursors get the search filter in
>>>>> order to perform filtering within the store instead of returning all
>>>>> candidates and evaluate them.
>>>> Some toughts about this one-level/sub-level index.
>>>>
>>>> Using the Rdn index makes perfect sense : we have the Rdn ->    parent
>>>> relation
>>>> plus the parent ->    children relation in this index, so there is no need
>>>> to
>>>> have a one level index (all the children are already listed in the RDN
>>>> index
>>>> for a specific entry). I'm a bit more concerned about the sub-level
>>>> processing : we have to recurse on all the children to get all the
>>>> candidates. That's fine, we can easily implement that (and you already
>>>> did),
>>>> but what concerns me is that we don't have the count of all the entries,
>>>> we
>>>> will have to compute them. This count is necessary in the search engine
>>>> to
>>>> select the index we will use to walk the entries.
>>>>
>>>> One solution would be to store two more elements in the ParentIdAndRdn
>>>> data
>>>> structure : the number of children directly below the RDN, and the number
>>>> of
>>>> children and descendant. That would probably solve the issue I'm
>>>> mentioning.
>>>> Of course, that also means we wil have to update all the RDN hierarchy
>>>> from
>>>> top to bottom (but affecting only the RDN part of the entry DN) each time
>>>> we
>>>> add/move/delete an entry. Note that we already do that for the oneLevel
>>>> and
>>>> Sublevel index.
>>>>
>>> Just to make a point:
>>> I think, in the case of achieving SubLevel index evaluation with RDN
>>> index it becomes a costly and complex operation
>>> (recursive scanning and updating) where as with the current sublevel
>>> index it takes O(1) to fetch all the sublevel children of
>>> an entry.
>>> Not sure if HBase has any features to solve in an efficient manner
>> If you think about what is done when we add an entry in the current code
>> base :
>> - add the entry in the MasterTable
>> - add the RDN/parent into the RdnIndex
>> - update the one-level index with the newly added entry reference,
>> increasing the number of children for the parent
>> - for each RDN in the parent, update the sub-level index with the newly
>> added entry reference, increasing the number of children for the parent
>>
>> If we compare what we would do if we remove the one-level/sub-level index :
>> - add the entry in the MasterTable
>> - add the RDN/parent into the RdnIndex
>> - for each RDN in the parent, update the rdn index with the newly added
>> entry reference, increasing the number of children for the parent
>>
> this count needs to be updated in more than one ParentAndRdn entry(for
> sublevel indexing feature only) depending on the level
> at which the child entry was added(cause each entry in the ancestor
> chain is a valid base to perform sublevel search)
> and also when it comes to fetch all the sublevel entries we need to
> scan recursively.
Right, and this is why I added a 'for each RDN ' in the last operation. 
Btw, this is exactly the same kind of cost we already have to pay to 
update the sub-level index.


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: HBase partition integration in trunks ?

Posted by Kiran Ayyagari <ka...@apache.org>.
On Tue, Aug 16, 2011 at 6:27 PM, Emmanuel Lecharny <el...@gmail.com> wrote:
> On 8/16/11 2:22 PM, Kiran Ayyagari wrote:
>>
>> On Tue, Aug 16, 2011 at 1:23 PM, Emmanuel Lécharny<el...@apache.org>
>>  wrote:
>>>
>>> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>>>
>>>> Now I have to update the parts that are a bit special, let me explain:
>>>> In HBase partition I didn't use one-level and sub-level indices, but
>>>> use the RDN index table instead. I also extended the search engine in
>>>> that way that one-level and sub-level cursors get the search filter in
>>>> order to perform filtering within the store instead of returning all
>>>> candidates and evaluate them.
>>>
>>> Some toughts about this one-level/sub-level index.
>>>
>>> Using the Rdn index makes perfect sense : we have the Rdn ->  parent
>>> relation
>>> plus the parent ->  children relation in this index, so there is no need
>>> to
>>> have a one level index (all the children are already listed in the RDN
>>> index
>>> for a specific entry). I'm a bit more concerned about the sub-level
>>> processing : we have to recurse on all the children to get all the
>>> candidates. That's fine, we can easily implement that (and you already
>>> did),
>>> but what concerns me is that we don't have the count of all the entries,
>>> we
>>> will have to compute them. This count is necessary in the search engine
>>> to
>>> select the index we will use to walk the entries.
>>>
>>> One solution would be to store two more elements in the ParentIdAndRdn
>>> data
>>> structure : the number of children directly below the RDN, and the number
>>> of
>>> children and descendant. That would probably solve the issue I'm
>>> mentioning.
>>> Of course, that also means we wil have to update all the RDN hierarchy
>>> from
>>> top to bottom (but affecting only the RDN part of the entry DN) each time
>>> we
>>> add/move/delete an entry. Note that we already do that for the oneLevel
>>> and
>>> Sublevel index.
>>>
>> Just to make a point:
>> I think, in the case of achieving SubLevel index evaluation with RDN
>> index it becomes a costly and complex operation
>> (recursive scanning and updating) where as with the current sublevel
>> index it takes O(1) to fetch all the sublevel children of
>> an entry.
>> Not sure if HBase has any features to solve in an efficient manner
>
> If you think about what is done when we add an entry in the current code
> base :
> - add the entry in the MasterTable
> - add the RDN/parent into the RdnIndex
> - update the one-level index with the newly added entry reference,
> increasing the number of children for the parent
> - for each RDN in the parent, update the sub-level index with the newly
> added entry reference, increasing the number of children for the parent
>
> If we compare what we would do if we remove the one-level/sub-level index :
> - add the entry in the MasterTable
> - add the RDN/parent into the RdnIndex
> - for each RDN in the parent, update the rdn index with the newly added
> entry reference, increasing the number of children for the parent
>
this count needs to be updated in more than one ParentAndRdn entry(for
sublevel indexing feature only) depending on the level
at which the child entry was added(cause each entry in the ancestor
chain is a valid base to perform sublevel search)
and also when it comes to fetch all the sublevel entries we need to
scan recursively.
> This is one less operation, one less index updated.
>
> Also we don't have to recurse to get the number of children for a specific
> entry when processing a SUB search, as we have this number in the
> RdnAndParentAndCount data structure. (this will be an extended structure for
> what is currently stored in the Rdn index).
>
> At least, this is what I understand we must do if we go this way...
>
> Did I miss something ?
>
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>



-- 
Kiran Ayyagari

Re: HBase partition integration in trunks ?

Posted by Emmanuel Lecharny <el...@gmail.com>.
On 8/16/11 2:22 PM, Kiran Ayyagari wrote:
> On Tue, Aug 16, 2011 at 1:23 PM, Emmanuel Lécharny<el...@apache.org>  wrote:
>> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>> Now I have to update the parts that are a bit special, let me explain:
>>> In HBase partition I didn't use one-level and sub-level indices, but
>>> use the RDN index table instead. I also extended the search engine in
>>> that way that one-level and sub-level cursors get the search filter in
>>> order to perform filtering within the store instead of returning all
>>> candidates and evaluate them.
>> Some toughts about this one-level/sub-level index.
>>
>> Using the Rdn index makes perfect sense : we have the Rdn ->  parent relation
>> plus the parent ->  children relation in this index, so there is no need to
>> have a one level index (all the children are already listed in the RDN index
>> for a specific entry). I'm a bit more concerned about the sub-level
>> processing : we have to recurse on all the children to get all the
>> candidates. That's fine, we can easily implement that (and you already did),
>> but what concerns me is that we don't have the count of all the entries, we
>> will have to compute them. This count is necessary in the search engine to
>> select the index we will use to walk the entries.
>>
>> One solution would be to store two more elements in the ParentIdAndRdn data
>> structure : the number of children directly below the RDN, and the number of
>> children and descendant. That would probably solve the issue I'm mentioning.
>> Of course, that also means we wil have to update all the RDN hierarchy from
>> top to bottom (but affecting only the RDN part of the entry DN) each time we
>> add/move/delete an entry. Note that we already do that for the oneLevel and
>> Sublevel index.
>>
> Just to make a point:
> I think, in the case of achieving SubLevel index evaluation with RDN
> index it becomes a costly and complex operation
> (recursive scanning and updating) where as with the current sublevel
> index it takes O(1) to fetch all the sublevel children of
> an entry.
> Not sure if HBase has any features to solve in an efficient manner

If you think about what is done when we add an entry in the current code 
base :
- add the entry in the MasterTable
- add the RDN/parent into the RdnIndex
- update the one-level index with the newly added entry reference, 
increasing the number of children for the parent
- for each RDN in the parent, update the sub-level index with the newly 
added entry reference, increasing the number of children for the parent

If we compare what we would do if we remove the one-level/sub-level index :
- add the entry in the MasterTable
- add the RDN/parent into the RdnIndex
- for each RDN in the parent, update the rdn index with the newly added 
entry reference, increasing the number of children for the parent

This is one less operation, one less index updated.

Also we don't have to recurse to get the number of children for a 
specific entry when processing a SUB search, as we have this number in 
the RdnAndParentAndCount data structure. (this will be an extended 
structure for what is currently stored in the Rdn index).

At least, this is what I understand we must do if we go this way...

Did I miss something ?


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: HBase partition integration in trunks ?

Posted by Kiran Ayyagari <ka...@apache.org>.
On Tue, Aug 16, 2011 at 7:06 PM, Emmanuel Lecharny <el...@gmail.com> wrote:
> On 8/16/11 3:22 PM, Kiran Ayyagari wrote:
>>
>> On Tue, Aug 16, 2011 at 6:45 PM, Stefan Seelmann<se...@apache.org>
>>  wrote:
>>>
>>> On Tue, Aug 16, 2011 at 2:22 PM, Kiran Ayyagari<ka...@apache.org>
>>>  wrote:
>>>>>
>>>>> One solution would be to store two more elements in the ParentIdAndRdn
>>>>> data
>>>>> structure : the number of children directly below the RDN, and the
>>>>> number of
>>>>> children and descendant. That would probably solve the issue I'm
>>>>> mentioning.
>>>>> Of course, that also means we wil have to update all the RDN hierarchy
>>>>> from
>>>>> top to bottom (but affecting only the RDN part of the entry DN) each
>>>>> time we
>>>>> add/move/delete an entry. Note that we already do that for the oneLevel
>>>>> and
>>>>> Sublevel index.
>>>>>
>>>> Just to make a point:
>>>> I think, in the case of achieving SubLevel index evaluation with RDN
>>>> index it becomes a costly and complex operation
>>>> (recursive scanning and updating) where as with the current sublevel
>>>> index it takes O(1) to fetch all the sublevel children of
>>>> an entry.
>>>
>>> Hm, evalutation can easly be done by using the reverse RDN index table.
>>>
>> for one level it is straight forward, but for sublevel we still need
>> to use recursion, no?
>
> No, if we update all the parents when we add/move/delete an entry. That
> means we have to update more than one RdnAndparent element (in fact, as many
> as we have RDns but the namingContext in the DN). Nothing more though that
> what we already do in the sub-level context.
>
> Then we have a O(1) operation to get the number of children in the case of a
> SUB search.
>
if we are only talking about 'number of children' then this holds
good, but not when it comes to getting the
sublevel child entry ids, in case of existing sublevel index we get
*all* the child ids in one lookup
> Now, Stefan is right, fetching entries means we use more than one cursor for
> that.
>
yeap, the above step is where this many cursors will be opened to
collect the entry ids
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>



-- 
Kiran Ayyagari

Re: HBase partition integration in trunks ?

Posted by Emmanuel Lecharny <el...@gmail.com>.
On 8/16/11 3:22 PM, Kiran Ayyagari wrote:
> On Tue, Aug 16, 2011 at 6:45 PM, Stefan Seelmann<se...@apache.org>  wrote:
>> On Tue, Aug 16, 2011 at 2:22 PM, Kiran Ayyagari<ka...@apache.org>  wrote:
>>>> One solution would be to store two more elements in the ParentIdAndRdn data
>>>> structure : the number of children directly below the RDN, and the number of
>>>> children and descendant. That would probably solve the issue I'm mentioning.
>>>> Of course, that also means we wil have to update all the RDN hierarchy from
>>>> top to bottom (but affecting only the RDN part of the entry DN) each time we
>>>> add/move/delete an entry. Note that we already do that for the oneLevel and
>>>> Sublevel index.
>>>>
>>> Just to make a point:
>>> I think, in the case of achieving SubLevel index evaluation with RDN
>>> index it becomes a costly and complex operation
>>> (recursive scanning and updating) where as with the current sublevel
>>> index it takes O(1) to fetch all the sublevel children of
>>> an entry.
>> Hm, evalutation can easly be done by using the reverse RDN index table.
>>
> for one level it is straight forward, but for sublevel we still need
> to use recursion, no?

No, if we update all the parents when we add/move/delete an entry. That 
means we have to update more than one RdnAndparent element (in fact, as 
many as we have RDns but the namingContext in the DN). Nothing more 
though that what we already do in the sub-level context.

Then we have a O(1) operation to get the number of children in the case 
of a SUB search.

Now, Stefan is right, fetching entries means we use more than one cursor 
for that.


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: HBase partition integration in trunks ?

Posted by Kiran Ayyagari <ka...@apache.org>.
On Tue, Aug 16, 2011 at 6:45 PM, Stefan Seelmann <se...@apache.org> wrote:
> On Tue, Aug 16, 2011 at 2:22 PM, Kiran Ayyagari <ka...@apache.org> wrote:
>>> One solution would be to store two more elements in the ParentIdAndRdn data
>>> structure : the number of children directly below the RDN, and the number of
>>> children and descendant. That would probably solve the issue I'm mentioning.
>>> Of course, that also means we wil have to update all the RDN hierarchy from
>>> top to bottom (but affecting only the RDN part of the entry DN) each time we
>>> add/move/delete an entry. Note that we already do that for the oneLevel and
>>> Sublevel index.
>>>
>> Just to make a point:
>> I think, in the case of achieving SubLevel index evaluation with RDN
>> index it becomes a costly and complex operation
>> (recursive scanning and updating) where as with the current sublevel
>> index it takes O(1) to fetch all the sublevel children of
>> an entry.
>
> Hm, evalutation can easly be done by using the reverse RDN index table.
>
for one level it is straight forward, but for sublevel we still need
to use recursion, no?


-- 
Kiran Ayyagari

Re: HBase partition integration in trunks ?

Posted by Stefan Seelmann <se...@apache.org>.
On Tue, Aug 16, 2011 at 2:22 PM, Kiran Ayyagari <ka...@apache.org> wrote:
>> One solution would be to store two more elements in the ParentIdAndRdn data
>> structure : the number of children directly below the RDN, and the number of
>> children and descendant. That would probably solve the issue I'm mentioning.
>> Of course, that also means we wil have to update all the RDN hierarchy from
>> top to bottom (but affecting only the RDN part of the entry DN) each time we
>> add/move/delete an entry. Note that we already do that for the oneLevel and
>> Sublevel index.
>>
> Just to make a point:
> I think, in the case of achieving SubLevel index evaluation with RDN
> index it becomes a costly and complex operation
> (recursive scanning and updating) where as with the current sublevel
> index it takes O(1) to fetch all the sublevel children of
> an entry.

Hm, evalutation can easly be done by using the reverse RDN index table.

Re: HBase partition integration in trunks ?

Posted by Kiran Ayyagari <ka...@apache.org>.
On Tue, Aug 16, 2011 at 1:23 PM, Emmanuel Lécharny <el...@apache.org> wrote:
> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>
>> Now I have to update the parts that are a bit special, let me explain:
>> In HBase partition I didn't use one-level and sub-level indices, but
>> use the RDN index table instead. I also extended the search engine in
>> that way that one-level and sub-level cursors get the search filter in
>> order to perform filtering within the store instead of returning all
>> candidates and evaluate them.
>
> Some toughts about this one-level/sub-level index.
>
> Using the Rdn index makes perfect sense : we have the Rdn -> parent relation
> plus the parent -> children relation in this index, so there is no need to
> have a one level index (all the children are already listed in the RDN index
> for a specific entry). I'm a bit more concerned about the sub-level
> processing : we have to recurse on all the children to get all the
> candidates. That's fine, we can easily implement that (and you already did),
> but what concerns me is that we don't have the count of all the entries, we
> will have to compute them. This count is necessary in the search engine to
> select the index we will use to walk the entries.
>
> One solution would be to store two more elements in the ParentIdAndRdn data
> structure : the number of children directly below the RDN, and the number of
> children and descendant. That would probably solve the issue I'm mentioning.
> Of course, that also means we wil have to update all the RDN hierarchy from
> top to bottom (but affecting only the RDN part of the entry DN) each time we
> add/move/delete an entry. Note that we already do that for the oneLevel and
> Sublevel index.
>
Just to make a point:
I think, in the case of achieving SubLevel index evaluation with RDN
index it becomes a costly and complex operation
(recursive scanning and updating) where as with the current sublevel
index it takes O(1) to fetch all the sublevel children of
an entry.
Not sure if HBase has any features to solve in an efficient manner
> All in all, I do think this is feasable, and you probably already have
> implemented such logic in the HBase partition.
>
> Can you tell me if what I wrote above makes sense for HBase but also for the
> whole system ?
>
> If we could get rid of the one-level/sub-level index, we would speed-up the
> add/move/delete operation greatly (as we will spare two index update),
> saving probably 25% of the time needed to update the backend (we will just
> have 5 index to update instead of 7). It might also speed up the search
> marginally, as we won't have to do  look-up in the one-level or sub-level
> index to build the scope filter.
>
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>



-- 
Kiran Ayyagari

Re: HBase partition integration in trunks ?

Posted by Emmanuel Lécharny <el...@apache.org>.
On 8/15/11 5:59 PM, Stefan Seelmann wrote:
> Now I have to update the parts that are a bit special, let me explain:
> In HBase partition I didn't use one-level and sub-level indices, but
> use the RDN index table instead. I also extended the search engine in
> that way that one-level and sub-level cursors get the search filter in
> order to perform filtering within the store instead of returning all
> candidates and evaluate them.
Some toughts about this one-level/sub-level index.

Using the Rdn index makes perfect sense : we have the Rdn -> parent 
relation plus the parent -> children relation in this index, so there is 
no need to have a one level index (all the children are already listed 
in the RDN index for a specific entry). I'm a bit more concerned about 
the sub-level processing : we have to recurse on all the children to get 
all the candidates. That's fine, we can easily implement that (and you 
already did), but what concerns me is that we don't have the count of 
all the entries, we will have to compute them. This count is necessary 
in the search engine to select the index we will use to walk the entries.

One solution would be to store two more elements in the ParentIdAndRdn 
data structure : the number of children directly below the RDN, and the 
number of children and descendant. That would probably solve the issue 
I'm mentioning. Of course, that also means we wil have to update all the 
RDN hierarchy from top to bottom (but affecting only the RDN part of the 
entry DN) each time we add/move/delete an entry. Note that we already do 
that for the oneLevel and Sublevel index.

All in all, I do think this is feasable, and you probably already have 
implemented such logic in the HBase partition.

Can you tell me if what I wrote above makes sense for HBase but also for 
the whole system ?

If we could get rid of the one-level/sub-level index, we would speed-up 
the add/move/delete operation greatly (as we will spare two index 
update), saving probably 25% of the time needed to update the backend 
(we will just have 5 index to update instead of 7). It might also speed 
up the search marginally, as we won't have to do  look-up in the 
one-level or sub-level index to build the scope filter.


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: HBase partition integration in trunks ?

Posted by Emmanuel Lécharny <el...@apache.org>.
On 8/15/11 5:59 PM, Stefan Seelmann wrote:
> On Sun, Jul 24, 2011 at 10:43 AM, Stefan Seelmann<se...@apache.org>  wrote:
>> On Sun, Jul 24, 2011 at 12:43 AM, Emmanuel Lecharny<el...@gmail.com>  wrote:
>>> I was just wondering if we should not add the HBase partition in trunks, now
>>> that the partition code has been cleaned up ?
>>>
>>> What's is the status for this code ?
>> The last time I updated the code was back in November 2010. In the
>> meantime the ApacheDS and HBase code changed so there is some work to
>> do. HBase is now at version 0.90.3 (0.90.4 is in the pipeline), the
>> API is stable and available in maven central which is a good thing. I
>> can try to update the code next week, then I can give a better status
>> report.
> As always it took a bit longer :-)

Yeah, we all have this feeling all the time :)
>
> I updated the HBase partition code in my sandbox. Now HBase 0.90.3 API
> is used. Also changes in shared and partition API are updated. Basic
> tests are also running. So far so good.
Great !
>
> Now I have to update the parts that are a bit special, let me explain:
> In HBase partition I didn't use one-level and sub-level indices, but
> use the RDN index table instead. I also extended the search engine in
> that way that one-level and sub-level cursors get the search filter in
> order to perform filtering within the store instead of returning all
> candidates and evaluate them.
>
> So instead of updating those parts I'd prefer to make such changes in
> the xdbm-partition module.
>
> The first step would be to remove the one-level and sub-level indices
> and use the RDN index instead. I'll write a separate mail with a
> proposal.
Okie. FYI, I have started to cut a release for Apache DS 2.0-M2. It 
would be a good idea to start from this release once voted, so that we 
can add the HBase partition in M3.


Thanks a lot, Stefan !

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com