You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2012/03/22 16:40:17 UTC

[index] Presence index usage

Hi guys,

I will split my mails about index in smaller mails, in order to focus on 
one element at a time.

Let's first talk about the presence index.

Yesterday, I posted a question about the need to restrain this index to 
hold only the AT which are indexed. Alex explain that it was rational as 
it will cost less to store only the indexed AT in this index.

I have now a few more things to discuss about this index :

1) Do we need to have a reverse table for presence ?

IMO, no. We never use this reverse index in the current code, and there 
is no way to use it in a way it will bring some advantage. If we delete 
an entry, we will just have to remove all the <AT, entryID> tuple from 
the forward table, something we already do, no need to remove the 
<entryId,*> tuple from the reverse table.

By removing this reverse table, we save some disk space, plus some CPU 
(as we spare the removal from the reverse table).

2) What about storing all the AT into this index ?
 From the performance POV, this is not a good idea. We have around 1000 
different AT in a schema, and each entry with, say, N different AT will 
need to update the forward table for every single of those N AT. Costly.

Now, if we consider the fact that having all the AT stored in the index 
will allow us to know what will be the impacted entries if an AT is 
removed from the schema, then it can be a good thing to have a complete 
index with all the AT.

thoughts ?

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: [index] Presence index usage

Posted by Alex Karasulu <ak...@apache.org>.
On Thu, Mar 22, 2012 at 6:38 PM, <hy...@symas.com> wrote:

> On Thu, Mar 22, 2012 at 05:28:50PM +0100, Emmanuel Lécharny wrote:
> > Le 3/22/12 5:11 PM, hyc@symas.com a écrit :
> > >On Thu, Mar 22, 2012 at 04:40:17PM +0100, Emmanuel Lécharny wrote
>
> > >>Now, if we consider the fact that having all the AT stored in the
> > >>index will allow us to know what will be the impacted entries if an
> > >>AT is removed from the schema, then it can be a good thing to have a
> > >>complete index with all the AT.
> > >It's an interesting idea, if the admin was going to index it anyway.
> > >Otherwise, IMO you're optimizing for a very infrequent case, which
> > >is self-defeating.
> > Here, it's not about optimization, really.
> >
> > The idea is much more about bieng able to see if an AT removal from
> > the schema is likely to impact the data, without doing a full scan.
>
> Yes... but "avoiding a full scan" is just a (coarse) optimization of
> the schema change.
>
> > Not sure it's a sane politic though : removing an AT from a
> > production server sounds a bad idea...
>
> Agreed. And again, even if it's for a valid reason, it will occur once
> in a blue moon. Who cares how long it takes?
>
> If you're really concerned about this scenario, sounds like a refcount
> on the schema elements would be more straightforward.
>

+1 this would be easier to comprehend and maintain in the long run verses
this mechanism which couples the index to ref-count like functionality.

-- 
Best Regards,
-- Alex

Re: [index] Presence index usage

Posted by hy...@symas.com.
On Thu, Mar 22, 2012 at 05:28:50PM +0100, Emmanuel Lécharny wrote:
> Le 3/22/12 5:11 PM, hyc@symas.com a écrit :
> >On Thu, Mar 22, 2012 at 04:40:17PM +0100, Emmanuel Lécharny wrote

> >>Now, if we consider the fact that having all the AT stored in the
> >>index will allow us to know what will be the impacted entries if an
> >>AT is removed from the schema, then it can be a good thing to have a
> >>complete index with all the AT.
> >It's an interesting idea, if the admin was going to index it anyway.
> >Otherwise, IMO you're optimizing for a very infrequent case, which
> >is self-defeating.
> Here, it's not about optimization, really.
> 
> The idea is much more about bieng able to see if an AT removal from
> the schema is likely to impact the data, without doing a full scan.

Yes... but "avoiding a full scan" is just a (coarse) optimization of
the schema change.

> Not sure it's a sane politic though : removing an AT from a
> production server sounds a bad idea...

Agreed. And again, even if it's for a valid reason, it will occur once
in a blue moon. Who cares how long it takes?

If you're really concerned about this scenario, sounds like a refcount
on the schema elements would be more straightforward.
> 
> -- 
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
> 

Re: [index] Presence index usage

Posted by Emmanuel Lécharny <el...@gmail.com>.
Le 3/22/12 5:11 PM, hyc@symas.com a écrit :
> On Thu, Mar 22, 2012 at 04:40:17PM +0100, Emmanuel Lécharny wrote:
>> Hi guys,
>>
>> I will split my mails about index in smaller mails, in order to
>> focus on one element at a time.
>>
>> Let's first talk about the presence index.
>>
>> Yesterday, I posted a question about the need to restrain this index
>> to hold only the AT which are indexed. Alex explain that it was
>> rational as it will cost less to store only the indexed AT in this
>> index.
>>
>> I have now a few more things to discuss about this index :
>>
>> 2) What about storing all the AT into this index ?
>>  From the performance POV, this is not a good idea. We have around
>> 1000 different AT in a schema, and each entry with, say, N different
>> AT will need to update the forward table for every single of those N
>> AT. Costly.
> My rule of thumb is only index attrs that are actually used in search
> filters, and only if their frequency in the db is low. (If an attr is
> present in 100% of entries, then the index is totally superfluous.)
Makes sense. The Admin must be smart, and think before creating the 
indexes...
>
>> Now, if we consider the fact that having all the AT stored in the
>> index will allow us to know what will be the impacted entries if an
>> AT is removed from the schema, then it can be a good thing to have a
>> complete index with all the AT.
> It's an interesting idea, if the admin was going to index it anyway.
> Otherwise, IMO you're optimizing for a very infrequent case, which
> is self-defeating.
Here, it's not about optimization, really.

The idea is much more about bieng able to see if an AT removal from the 
schema is likely to impact the data, without doing a full scan.

Not sure it's a sane politic though : removing an AT from a production 
server sounds a bad idea...

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Re: [index] Presence index usage

Posted by hy...@symas.com.
On Thu, Mar 22, 2012 at 04:40:17PM +0100, Emmanuel Lécharny wrote:
> Hi guys,
> 
> I will split my mails about index in smaller mails, in order to
> focus on one element at a time.
> 
> Let's first talk about the presence index.
> 
> Yesterday, I posted a question about the need to restrain this index
> to hold only the AT which are indexed. Alex explain that it was
> rational as it will cost less to store only the indexed AT in this
> index.
> 
> I have now a few more things to discuss about this index :
> 
> 2) What about storing all the AT into this index ?
> From the performance POV, this is not a good idea. We have around
> 1000 different AT in a schema, and each entry with, say, N different
> AT will need to update the forward table for every single of those N
> AT. Costly.

My rule of thumb is only index attrs that are actually used in search
filters, and only if their frequency in the db is low. (If an attr is
present in 100% of entries, then the index is totally superfluous.)

> Now, if we consider the fact that having all the AT stored in the
> index will allow us to know what will be the impacted entries if an
> AT is removed from the schema, then it can be a good thing to have a
> complete index with all the AT.

It's an interesting idea, if the admin was going to index it anyway.
Otherwise, IMO you're optimizing for a very infrequent case, which
is self-defeating.
> 
> thoughts ?
> 
> -- 
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>