You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@gmail.com> on 2011/06/23 18:20:21 UTC

Index reverse tables : are they useful ?

Hi guys,

as I was reviewing the Table interface, and the Index hierarchy, I had a 
bit of time to try to understand this part of the server I was not 
comfortable with. And I saw that each index is using two tables : a 
forward and a reverse table.

The forward table is obviously mandatory. It is used to retrieve the 
list of entry's ID from an AT value. Fine. But what about the reverse 
table ?

Alex told me that it's useful when dealing with such filters :
(&(cn=A)(sn=B)).

The optimization engine will get the index that provides the smallest 
set of candidate (say, CN here). We then grab from the CN forward table 
all the entry's ID associated with the 'A' value. Let's call it the 
CN-IDs set.

Now, if we were to grab all the entries from the MasterTable to compare 
their SN attribute with the 'B' value, t would be quite costly. We use 
the SN's reverse table. For each entry's ID, we check if the associated 
value in the SN's reverse table exists or not. If it exists, we have a 
valid candidate, otherwise, we simply discard the candidate.

As we can see, we never fetch the entry unless we have a valid candidate 
(of course, if the SN and CN AT are indexed).

However, I do think this reverse table is spurious. We can get the same 
result by using the SN's forward table : for each entry's ID, check in 
the SVn's forward table if the set of values (SN-IDs) contains the ID we 
pulled from the CN-IDs set.

We can even do an SN-IDs ^ CN-IDs ( '^' <=> intersection) to get the 
list of valid candidates. No need to do N fetches in the reverse table 
(where N is the number of ID in CN-IDs).

I think we can remove those reverse table, and I expect that it can 
speed up the search a bit, but mainly speed up greatly any modification 
operation (add, delete, modify) and even decrease the disk consumption.

May be I'm missing something though. I have created a branch 
(http://svn.apache.org/viewvc/directory/apacheds/branches/apacheds-no-reverse-index/) 
to play with this idea.


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Index reverse tables : are they useful ?

Posted by Alex Karasulu <ak...@apache.org>.

On Fri, Jun 24, 2011 at 11:04 AM, Alex Karasulu <ak...@apache.org> wrote:
> On Fri, Jun 24, 2011 at 11:00 AM, Emmanuel Lécharny
> <el...@apache.org> wrote:
>> On 6/24/11 9:51 AM, Alex Karasulu wrote:
>>>
>>>>> The reverse index has no duplicate keys. The only way to get a
>>>>> duplicate key in the reverse index is if the same entry (i.e. 37)
>>>>> contained the same value ('foo') for the same (sn) attribute. And this
>>>>> we know is not possible. So the lookups against the reverse table will
>>>>> be faster.
>>>>
>>>> I was thinking about something a bit different : as soon as you have
>>>> grabbed
>>>> the list of entry's ID from the first index, looking into the other
>>>> indexes
>>>> will also return a list of Entry's ID. Checking if those IDs are valid
>>>> candidate can then be done in one shot : do the intersection of the two
>>>> sets
>>>> (they are ordered, so it's a O(n) operation) and just get the matching
>>>> entries.
>>>>
>>>> Compared to the current processing (ie, accessing the reverse index for
>>>> *each* candidate), this will be way faster, IMO.
>>>
>>> This is a VERY interesting idea. Maybe we should create a separate
>>> thread for this and drive deeper into it. You got something I think
>>> here.
>>>
>> have a look at
>> https://cwiki.apache.org/confluence/display/DIRxSRVx11/Index+and+IndexEntry,
>> where I added some paragraphs explaining this idea. We can comment on this
>> page.
>
> Nice pictures - what did you use for that? Reading further ...
>
> Also if you're doing this in a branch, hence we're not yet committed
> on the approach, can you please do this on a separate page so you
> don't alter the existing documentation?

I may have spoke too soon. This stuff just seems to elaborate on how
indices works. Maybe we should use some idea sections for the new
concepts we're testing in this area.

Alex

Re: Index reverse tables : are they useful ?

Posted by Alex Karasulu <ak...@apache.org>.

On Fri, Jun 24, 2011 at 11:07 AM, Emmanuel Lécharny
<el...@apache.org> wrote:
> On 6/24/11 10:04 AM, Alex Karasulu wrote:
>>
>> On Fri, Jun 24, 2011 at 11:00 AM, Emmanuel Lécharny
>> <el...@apache.org>  wrote:
>>>
>>> On 6/24/11 9:51 AM, Alex Karasulu wrote:
>>>>>>
>>>>>> The reverse index has no duplicate keys. The only way to get a
>>>>>> duplicate key in the reverse index is if the same entry (i.e. 37)
>>>>>> contained the same value ('foo') for the same (sn) attribute. And this
>>>>>> we know is not possible. So the lookups against the reverse table will
>>>>>> be faster.
>>>>>
>>>>> I was thinking about something a bit different : as soon as you have
>>>>> grabbed
>>>>> the list of entry's ID from the first index, looking into the other
>>>>> indexes
>>>>> will also return a list of Entry's ID. Checking if those IDs are valid
>>>>> candidate can then be done in one shot : do the intersection of the two
>>>>> sets
>>>>> (they are ordered, so it's a O(n) operation) and just get the matching
>>>>> entries.
>>>>>
>>>>> Compared to the current processing (ie, accessing the reverse index for
>>>>> *each* candidate), this will be way faster, IMO.
>>>>
>>>> This is a VERY interesting idea. Maybe we should create a separate
>>>> thread for this and drive deeper into it. You got something I think
>>>> here.
>>>>
>>> have a look at
>>>
>>> https://cwiki.apache.org/confluence/display/DIRxSRVx11/Index+and+IndexEntry,
>>> where I added some paragraphs explaining this idea. We can comment on
>>> this
>>> page.
>>
>> Nice pictures - what did you use for that? Reading further ...
>
> YeD
>>
>> Also if you're doing this in a branch, hence we're not yet committed
>> on the approach, can you please do this on a separate page so you
>> don't alter the existing documentation?
>
> Right now, I just updated the existing doco with some extended explanation
> on the existing code. I can move suggestions to another page, sure.

Nah you did good - I spoke too soon.

Thanks,
Alex

Re: Index reverse tables : are they useful ?

Posted by Emmanuel Lécharny <el...@apache.org>.

On 6/24/11 10:04 AM, Alex Karasulu wrote:
> On Fri, Jun 24, 2011 at 11:00 AM, Emmanuel Lécharny
> <el...@apache.org>  wrote:
>> On 6/24/11 9:51 AM, Alex Karasulu wrote:
>>>>> The reverse index has no duplicate keys. The only way to get a
>>>>> duplicate key in the reverse index is if the same entry (i.e. 37)
>>>>> contained the same value ('foo') for the same (sn) attribute. And this
>>>>> we know is not possible. So the lookups against the reverse table will
>>>>> be faster.
>>>> I was thinking about something a bit different : as soon as you have
>>>> grabbed
>>>> the list of entry's ID from the first index, looking into the other
>>>> indexes
>>>> will also return a list of Entry's ID. Checking if those IDs are valid
>>>> candidate can then be done in one shot : do the intersection of the two
>>>> sets
>>>> (they are ordered, so it's a O(n) operation) and just get the matching
>>>> entries.
>>>>
>>>> Compared to the current processing (ie, accessing the reverse index for
>>>> *each* candidate), this will be way faster, IMO.
>>> This is a VERY interesting idea. Maybe we should create a separate
>>> thread for this and drive deeper into it. You got something I think
>>> here.
>>>
>> have a look at
>> https://cwiki.apache.org/confluence/display/DIRxSRVx11/Index+and+IndexEntry,
>> where I added some paragraphs explaining this idea. We can comment on this
>> page.
> Nice pictures - what did you use for that? Reading further ...
YeD
> Also if you're doing this in a branch, hence we're not yet committed
> on the approach, can you please do this on a separate page so you
> don't alter the existing documentation?
Right now, I just updated the existing doco with some extended 
explanation on the existing code. I can move suggestions to another 
page, sure.

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Index reverse tables : are they useful ?

Posted by Alex Karasulu <ak...@apache.org>.

On Fri, Jun 24, 2011 at 11:00 AM, Emmanuel Lécharny
<el...@apache.org> wrote:
> On 6/24/11 9:51 AM, Alex Karasulu wrote:
>>
>>>> The reverse index has no duplicate keys. The only way to get a
>>>> duplicate key in the reverse index is if the same entry (i.e. 37)
>>>> contained the same value ('foo') for the same (sn) attribute. And this
>>>> we know is not possible. So the lookups against the reverse table will
>>>> be faster.
>>>
>>> I was thinking about something a bit different : as soon as you have
>>> grabbed
>>> the list of entry's ID from the first index, looking into the other
>>> indexes
>>> will also return a list of Entry's ID. Checking if those IDs are valid
>>> candidate can then be done in one shot : do the intersection of the two
>>> sets
>>> (they are ordered, so it's a O(n) operation) and just get the matching
>>> entries.
>>>
>>> Compared to the current processing (ie, accessing the reverse index for
>>> *each* candidate), this will be way faster, IMO.
>>
>> This is a VERY interesting idea. Maybe we should create a separate
>> thread for this and drive deeper into it. You got something I think
>> here.
>>
> have a look at
> https://cwiki.apache.org/confluence/display/DIRxSRVx11/Index+and+IndexEntry,
> where I added some paragraphs explaining this idea. We can comment on this
> page.

Nice pictures - what did you use for that? Reading further ...

Also if you're doing this in a branch, hence we're not yet committed
on the approach, can you please do this on a separate page so you
don't alter the existing documentation?

Thanks,
Alex

Re: Index reverse tables : are they useful ?

Posted by Emmanuel Lécharny <el...@apache.org>.

On 6/24/11 9:51 AM, Alex Karasulu wrote:
>
>>> The reverse index has no duplicate keys. The only way to get a
>>> duplicate key in the reverse index is if the same entry (i.e. 37)
>>> contained the same value ('foo') for the same (sn) attribute. And this
>>> we know is not possible. So the lookups against the reverse table will
>>> be faster.
>> I was thinking about something a bit different : as soon as you have grabbed
>> the list of entry's ID from the first index, looking into the other indexes
>> will also return a list of Entry's ID. Checking if those IDs are valid
>> candidate can then be done in one shot : do the intersection of the two sets
>> (they are ordered, so it's a O(n) operation) and just get the matching
>> entries.
>>
>> Compared to the current processing (ie, accessing the reverse index for
>> *each* candidate), this will be way faster, IMO.
> This is a VERY interesting idea. Maybe we should create a separate
> thread for this and drive deeper into it. You got something I think
> here.
>
have a look at 
https://cwiki.apache.org/confluence/display/DIRxSRVx11/Index+and+IndexEntry, 
where I added some paragraphs explaining this idea. We can comment on 
this page.

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Index reverse tables : are they useful ?

Posted by Alex Karasulu <ak...@apache.org>.

On Fri, Jun 24, 2011 at 10:58 AM, Emmanuel Lécharny
<el...@apache.org> wrote:
> On 6/24/11 9:51 AM, Alex Karasulu wrote:
>>
>>> Note that it's true for the AND connector, but even for the OR connector,
>>> we
>>> don't have this kind of issue.
>>
>> Sorry don't understand connector? You probably mean the cursors?
>
> Sorry, from the mathematic semantic stand point, OR and AND are connectors
> (AFAIU). And /Or are "connecting" two terms together to form a new result.
>
> I should probably have said operators.

OK got it.

Re: Index reverse tables : are they useful ?

Posted by Emmanuel Lécharny <el...@apache.org>.

On 6/24/11 9:51 AM, Alex Karasulu wrote:
>
>> Note that it's true for the AND connector, but even for the OR connector, we
>> don't have this kind of issue.
> Sorry don't understand connector? You probably mean the cursors?

Sorry, from the mathematic semantic stand point, OR and AND are 
connectors (AFAIU). And /Or are "connecting" two terms together to form 
a new result.

I should probably have said operators.

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Index reverse tables : are they useful ?

Posted by Alex Karasulu <ak...@apache.org>.

On Fri, Jun 24, 2011 at 10:47 AM, Emmanuel Lécharny
<el...@apache.org> wrote:
> On 6/24/11 9:40 AM, Alex Karasulu wrote:
>>
>> On Thu, Jun 23, 2011 at 7:20 PM, Emmanuel Lecharny<el...@gmail.com>
>>  wrote:
>>>
>>> Hi guys,
>>>
>>> as I was reviewing the Table interface, and the Index hierarchy, I had a
>>> bit
>>> of time to try to understand this part of the server I was not
>>> comfortable
>>> with. And I saw that each index is using two tables : a forward and a
>>> reverse table.
>>>
>>> The forward table is obviously mandatory. It is used to retrieve the list
>>> of
>>> entry's ID from an AT value. Fine. But what about the reverse table ?
>>>
>>> Alex told me that it's useful when dealing with such filters :
>>> (&(cn=A)(sn=B)).
>>>
>>> The optimization engine will get the index that provides the smallest set
>>> of
>>> candidate (say, CN here). We then grab from the CN forward table all the
>>> entry's ID associated with the 'A' value. Let's call it the CN-IDs set.
>>>
>>> Now, if we were to grab all the entries from the MasterTable to compare
>>> their SN attribute with the 'B' value, t would be quite costly. We use
>>> the
>>> SN's reverse table. For each entry's ID, we check if the associated value
>>> in
>>> the SN's reverse table exists or not. If it exists, we have a valid
>>> candidate, otherwise, we simply discard the candidate.
>>>
>>> As we can see, we never fetch the entry unless we have a valid candidate
>>> (of
>>> course, if the SN and CN AT are indexed).
>>>
>>> However, I do think this reverse table is spurious. We can get the same
>>> result by using the SN's forward table : for each entry's ID, check in
>>> the
>>> SVn's forward table if the set of values (SN-IDs) contains the ID we
>>> pulled
>>> from the CN-IDs set.
>>>
>>> We can even do an SN-IDs ^ CN-IDs ( '^'<=>  intersection) to get the list
>>> of
>>> valid candidates. No need to do N fetches in the reverse table (where N
>>> is
>>> the number of ID in CN-IDs).
>>>
>>> I think we can remove those reverse table, and I expect that it can speed
>>> up
>>> the search a bit, but mainly speed up greatly any modification operation
>>> (add, delete, modify) and even decrease the disk consumption.
>>
>> I think this is possible. It would be interesting to test these two
>> different systems using different kinds of data. The number of
>> duplicate keys will obviously impact the performance: it will be a
>> trade off.
>>
>> To test an Evaluator (i.e. sn=foo) on a candidate entry say id 37, we
>> need to perform a reverse lookup on the sn index. The btree key is 37
>> and we check if it contains a tuple with a value of 'foo'. If it does
>> then the Evaluator returns true if not it returns false. Now since we
>> know we're looking for tuple ('foo', 37) and we only have the forward
>> table we can lookup the 'foo' entries, and the values from duplicate
>> keys would be sorted. We would just need to check if 37 were in these
>> values.
>>
>> The reverse index has no duplicate keys. The only way to get a
>> duplicate key in the reverse index is if the same entry (i.e. 37)
>> contained the same value ('foo') for the same (sn) attribute. And this
>> we know is not possible. So the lookups against the reverse table will
>> be faster.
>
> I was thinking about something a bit different : as soon as you have grabbed
> the list of entry's ID from the first index, looking into the other indexes
> will also return a list of Entry's ID. Checking if those IDs are valid
> candidate can then be done in one shot : do the intersection of the two sets
> (they are ordered, so it's a O(n) operation) and just get the matching
> entries.
>
> Compared to the current processing (ie, accessing the reverse index for
> *each* candidate), this will be way faster, IMO.

This is a VERY interesting idea. Maybe we should create a separate
thread for this and drive deeper into it. You got something I think
here.

> Note that it's true for the AND connector, but even for the OR connector, we
> don't have this kind of issue.

Sorry don't understand connector? You probably mean the cursors?

Regards,
Alex

Re: Index reverse tables : are they useful ?

Posted by Emmanuel Lécharny <el...@apache.org>.

On 6/24/11 9:40 AM, Alex Karasulu wrote:
> On Thu, Jun 23, 2011 at 7:20 PM, Emmanuel Lecharny<el...@gmail.com>  wrote:
>> Hi guys,
>>
>> as I was reviewing the Table interface, and the Index hierarchy, I had a bit
>> of time to try to understand this part of the server I was not comfortable
>> with. And I saw that each index is using two tables : a forward and a
>> reverse table.
>>
>> The forward table is obviously mandatory. It is used to retrieve the list of
>> entry's ID from an AT value. Fine. But what about the reverse table ?
>>
>> Alex told me that it's useful when dealing with such filters :
>> (&(cn=A)(sn=B)).
>>
>> The optimization engine will get the index that provides the smallest set of
>> candidate (say, CN here). We then grab from the CN forward table all the
>> entry's ID associated with the 'A' value. Let's call it the CN-IDs set.
>>
>> Now, if we were to grab all the entries from the MasterTable to compare
>> their SN attribute with the 'B' value, t would be quite costly. We use the
>> SN's reverse table. For each entry's ID, we check if the associated value in
>> the SN's reverse table exists or not. If it exists, we have a valid
>> candidate, otherwise, we simply discard the candidate.
>>
>> As we can see, we never fetch the entry unless we have a valid candidate (of
>> course, if the SN and CN AT are indexed).
>>
>> However, I do think this reverse table is spurious. We can get the same
>> result by using the SN's forward table : for each entry's ID, check in the
>> SVn's forward table if the set of values (SN-IDs) contains the ID we pulled
>> from the CN-IDs set.
>>
>> We can even do an SN-IDs ^ CN-IDs ( '^'<=>  intersection) to get the list of
>> valid candidates. No need to do N fetches in the reverse table (where N is
>> the number of ID in CN-IDs).
>>
>> I think we can remove those reverse table, and I expect that it can speed up
>> the search a bit, but mainly speed up greatly any modification operation
>> (add, delete, modify) and even decrease the disk consumption.
> I think this is possible. It would be interesting to test these two
> different systems using different kinds of data. The number of
> duplicate keys will obviously impact the performance: it will be a
> trade off.
>
> To test an Evaluator (i.e. sn=foo) on a candidate entry say id 37, we
> need to perform a reverse lookup on the sn index. The btree key is 37
> and we check if it contains a tuple with a value of 'foo'. If it does
> then the Evaluator returns true if not it returns false. Now since we
> know we're looking for tuple ('foo', 37) and we only have the forward
> table we can lookup the 'foo' entries, and the values from duplicate
> keys would be sorted. We would just need to check if 37 were in these
> values.
>
> The reverse index has no duplicate keys. The only way to get a
> duplicate key in the reverse index is if the same entry (i.e. 37)
> contained the same value ('foo') for the same (sn) attribute. And this
> we know is not possible. So the lookups against the reverse table will
> be faster.

I was thinking about something a bit different : as soon as you have 
grabbed the list of entry's ID from the first index, looking into the 
other indexes will also return a list of Entry's ID. Checking if those 
IDs are valid candidate can then be done in one shot : do the 
intersection of the two sets (they are ordered, so it's a O(n) 
operation) and just get the matching entries.

Compared to the current processing (ie, accessing the reverse index for 
*each* candidate), this will be way faster, IMO.

Note that it's true for the AND connector, but even for the OR 
connector, we don't have this kind of issue.
>> May be I'm missing something though. I have created a branch
>> (http://svn.apache.org/viewvc/directory/apacheds/branches/apacheds-no-reverse-index/)
>> to play with this idea.
> As long as we're careful and don't just push this into the main branch
> I'm very happy to experiment with this. Let's see really what the
> costs are. Because I can sit around and theorize all day but some
> benchmarks really show us what reality is about.
The main issue so far is with the RDN, oneLevel and Subtree indexes, 
which are using entry's ID, and are representing the tree hierarchy.

It impacts the way we delete entries, too, as we currently use the 
reverse index to quickly retrieve the values to remove from the forward 
index. However, we can just iterate on the entry's attribute to do the 
same thing, invoking the drop( K, Value) method instead of invoking the 
drop( ID ) method.

Very interesting experiment !

I'll commit what I'm coming with today.

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Index reverse tables : are they useful ?

Posted by Alex Karasulu <ak...@apache.org>.

On Thu, Jun 23, 2011 at 7:20 PM, Emmanuel Lecharny <el...@gmail.com> wrote:
> Hi guys,
>
> as I was reviewing the Table interface, and the Index hierarchy, I had a bit
> of time to try to understand this part of the server I was not comfortable
> with. And I saw that each index is using two tables : a forward and a
> reverse table.
>
> The forward table is obviously mandatory. It is used to retrieve the list of
> entry's ID from an AT value. Fine. But what about the reverse table ?
>
> Alex told me that it's useful when dealing with such filters :
> (&(cn=A)(sn=B)).
>
> The optimization engine will get the index that provides the smallest set of
> candidate (say, CN here). We then grab from the CN forward table all the
> entry's ID associated with the 'A' value. Let's call it the CN-IDs set.
>
> Now, if we were to grab all the entries from the MasterTable to compare
> their SN attribute with the 'B' value, t would be quite costly. We use the
> SN's reverse table. For each entry's ID, we check if the associated value in
> the SN's reverse table exists or not. If it exists, we have a valid
> candidate, otherwise, we simply discard the candidate.
>
> As we can see, we never fetch the entry unless we have a valid candidate (of
> course, if the SN and CN AT are indexed).
>
> However, I do think this reverse table is spurious. We can get the same
> result by using the SN's forward table : for each entry's ID, check in the
> SVn's forward table if the set of values (SN-IDs) contains the ID we pulled
> from the CN-IDs set.
>
> We can even do an SN-IDs ^ CN-IDs ( '^' <=> intersection) to get the list of
> valid candidates. No need to do N fetches in the reverse table (where N is
> the number of ID in CN-IDs).
>
> I think we can remove those reverse table, and I expect that it can speed up
> the search a bit, but mainly speed up greatly any modification operation
> (add, delete, modify) and even decrease the disk consumption.

I think this is possible. It would be interesting to test these two
different systems using different kinds of data. The number of
duplicate keys will obviously impact the performance: it will be a
trade off.

To test an Evaluator (i.e. sn=foo) on a candidate entry say id 37, we
need to perform a reverse lookup on the sn index. The btree key is 37
and we check if it contains a tuple with a value of 'foo'. If it does
then the Evaluator returns true if not it returns false. Now since we
know we're looking for tuple ('foo', 37) and we only have the forward
table we can lookup the 'foo' entries, and the values from duplicate
keys would be sorted. We would just need to check if 37 were in these
values.

The reverse index has no duplicate keys. The only way to get a
duplicate key in the reverse index is if the same entry (i.e. 37)
contained the same value ('foo') for the same (sn) attribute. And this
we know is not possible. So the lookups against the reverse table will
be faster.

> May be I'm missing something though. I have created a branch
> (http://svn.apache.org/viewvc/directory/apacheds/branches/apacheds-no-reverse-index/)
> to play with this idea.

As long as we're careful and don't just push this into the main branch
I'm very happy to experiment with this. Let's see really what the
costs are. Because I can sit around and theorize all day but some
benchmarks really show us what reality is about.

Regards,
Alex