You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Alex Karasulu <ao...@bellsouth.net> on 2006/09/02 08:05:16 UTC
[ApacheDS] [Performance] Using indices to boost search performance
Hi all,
I've been testing the search performance boost gained from indexing
attributes before starting development on an optimization to improve
index performance and in memory size. I thought I'd share these
dramatic results of my pre-optimization tests with you since they
clearly show the benefits of indices.
Before I do this let me list the characteristics of the hardware used
and my configuration settings:
-------------
Machine Setup
-------------
CPU: Dual Athlon MP 1900
OS: Linux 2.6.15
Mem: 2GB 266Mhz
--------------
ApacheDS Setup
--------------
ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
- Using 1024MB Memory
- Indexed st and initials
----------
Data Setup
----------
Wrote a simple tool to generate random values for descent sized entries.
The data sort of looks like this for a user entry:
dn: uid=user.1,ou=users,dc=example,dc=com
uid: user.1
initials: yq
description: cFocJATNuhlXisDCqGtY
pager: FYyimqyZRW
cn: HSGMzajYKmicUTe
postalcode: WiXXA
st: xy
street: kpCCqmrsCzkpdtHXWMfY
l: KqmAXFYTrI
objectclass: person
objectclass: organizationalPerson
objectclass: inetOrgPerson
sn: nuymgOwpm
homephone: PERamkCtsv
mobile: vkIviOGNTC
telephonenumber: 7248889026
mail: pYvEoOjSnEymcWD
givenname: IVHJZB
postaladdress: crObexKoUTIFdzNHcZMr
employeenumber: 1
userpassword:: cGFzc3dvcmQ=
I started loading a partition up with these entries 100,000 of them at a
time then performing the following searches for all entries with
initials aa:
(1) index on initials but no cached entries
(2) index on initials with cached entries
(3) no index without cached entries
Here are the results at the various capacities:
---------------
100,000 Entries
---------------
[cached] [indexed] [time (seconds)]
(1) no yes 3.30
(2) yes yes 0.72
(3) no no 30.63
search results = 153 entries
---------------
200,000 Entries
---------------
[cached] [indexed] [time (seconds)]
(1) no yes 6.04
(2) yes yes 1.44
(3) no no 82
search results = 302 entries
---------------
300,000 Entries
---------------
[cached] [indexed] [time (seconds)]
(1) no yes 7.54
(2) yes yes 1.95
(3) no no 146
search results = 451 entries
---------------
400,000 Entries
---------------
[cached] [indexed] [time (seconds)]
(1) no yes 9.24
(2) yes yes 3.80
(3) no no 196
search results = 586 entries
---------------
500,000 Entries
---------------
[cached] [indexed] [time (seconds)]
(1) no yes 11.96
(2) yes yes 3.21
(3) no no 224
search results = 748 entries
Alex
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Emmanuel Lecharny <el...@gmail.com>.
Quanah Gibson-Mount a écrit :
>
>
> --On Saturday, September 02, 2006 2:05 AM -0400 Alex Karasulu
> <ao...@bellsouth.net> wrote:
>
>> Hi all,
>> ----------
>> Data Setup
>> ----------
>>
>> Wrote a simple tool to generate random values for descent sized entries.
>> The data sort of looks like this for a user entry:
>
>
> slamd has a really nice tool called "MakeLDIF" that'll do this for
> you, too. Plus then how you made your LDIF is publishable to others
> who may also want to run the same benchmark, since all you have to do
> is provide the template for MakeLDIF to use. ;)
Yeah, we know that. We have used it many times, and the file you see is
directly generated by MakeLdif, but with just a very slight difference,
because we wanted to test a very specific performance issue.
Anyway, thanks for the pointer, because I'm sure that many people
weren't aware of the existence of this tool :)
Emmanuel.
Re: [ApacheDS] [Performance] Using indices to boost search
performance
Posted by Quanah Gibson-Mount <qu...@stanford.edu>.
--On Saturday, September 02, 2006 2:05 AM -0400 Alex Karasulu
<ao...@bellsouth.net> wrote:
> Hi all,
> ----------
> Data Setup
> ----------
>
> Wrote a simple tool to generate random values for descent sized entries.
> The data sort of looks like this for a user entry:
slamd has a really nice tool called "MakeLDIF" that'll do this for you,
too. Plus then how you made your LDIF is publishable to others who may
also want to run the same benchmark, since all you have to do is provide
the template for MakeLDIF to use. ;)
--Quanah
--
Quanah Gibson-Mount
Principal Software Developer
ITS/Shared Application Services
Stanford University
GnuPG Public Key: http://www.stanford.edu/~quanah/pgp.html
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Ajay Upadhyaya <aj...@gmail.com>.
Thanks Alex. I'd also look forward to the enh related to reindexing feature.
-Ajay
On 9/3/06, Alex Karasulu <ao...@bellsouth.net> wrote:
>
> Ajay Upadhyaya wrote:
> > Could you share the steps needed to do the indexing, or point to any
> > existing doc... I'm relatively new to ADS,
>
> Sure no problem. I wrote up a little documentation here on adding
> indices. I just added this page just now very rapidly so it might be
> sparse.
>
> http://docs.safehaus.org/display/APACHEDS/Performance+Tuning
>
> Alex
>
> > Thanks,
> > Ajay
> >
> > On 9/1/06, *Alex Karasulu * <aok123@bellsouth.net
> > <ma...@bellsouth.net>> wrote:
> >
> > Hi all,
> >
> > I've been testing the search performance boost gained from indexing
> > attributes before starting development on an optimization to improve
> > index performance and in memory size. I thought I'd share these
> > dramatic results of my pre-optimization tests with you since they
> > clearly show the benefits of indices.
> >
> > Before I do this let me list the characteristics of the hardware
> used
> > and my configuration settings:
> >
> >
> > -------------
> > Machine Setup
> > -------------
> >
> > CPU: Dual Athlon MP 1900
> > OS: Linux 2.6.15
> > Mem: 2GB 266Mhz
> >
> >
> > --------------
> > ApacheDS Setup
> > --------------
> >
> > ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
> > - Using 1024MB Memory
> > - Indexed st and initials
> >
> >
> > ----------
> > Data Setup
> > ----------
> >
> > Wrote a simple tool to generate random values for descent sized
> entries.
> > The data sort of looks like this for a user entry:
> >
> > dn: uid=user.1,ou=users,dc=example,dc=com
> > uid: user.1
> > initials: yq
> > description: cFocJATNuhlXisDCqGtY
> > pager: FYyimqyZRW
> > cn: HSGMzajYKmicUTe
> > postalcode: WiXXA
> > st: xy
> > street: kpCCqmrsCzkpdtHXWMfY
> > l: KqmAXFYTrI
> > objectclass: person
> > objectclass: organizationalPerson
> > objectclass: inetOrgPerson
> > sn: nuymgOwpm
> > homephone: PERamkCtsv
> > mobile: vkIviOGNTC
> > telephonenumber: 7248889026
> > mail: pYvEoOjSnEymcWD
> > givenname: IVHJZB
> > postaladdress: crObexKoUTIFdzNHcZMr
> > employeenumber: 1
> > userpassword:: cGFzc3dvcmQ=
> >
> > I started loading a partition up with these entries 100,000 of them
> at a
> > time then performing the following searches for all entries with
> > initials aa:
> >
> > (1) index on initials but no cached entries
> > (2) index on initials with cached entries
> > (3) no index without cached entries
> >
> > Here are the results at the various capacities:
> >
> > ---------------
> > 100,000 Entries
> > ---------------
> >
> > [cached] [indexed] [time (seconds)]
> >
> > (1) no yes 3.30
> > (2) yes yes 0.72
> > (3) no no 30.63
> >
> > search results = 153 entries
> >
> >
> > ---------------
> > 200,000 Entries
> > ---------------
> >
> > [cached] [indexed] [time (seconds)]
> >
> > (1) no yes 6.04
> > (2) yes yes 1.44
> > (3) no no 82
> >
> > search results = 302 entries
> >
> >
> > ---------------
> > 300,000 Entries
> > ---------------
> >
> > [cached] [indexed] [time (seconds)]
> >
> > (1) no yes 7.54
> > (2) yes yes 1.95
> > (3) no no 146
> >
> > search results = 451 entries
> >
> >
> > ---------------
> > 400,000 Entries
> > ---------------
> >
> > [cached] [indexed] [time (seconds)]
> >
> > (1) no yes 9.24
> > (2) yes yes 3.80
> > (3) no no 196
> >
> > search results = 586 entries
> >
> >
> > ---------------
> > 500,000 Entries
> > ---------------
> >
> > [cached] [indexed] [time (seconds)]
> >
> > (1) no yes 11.96
> > (2) yes yes 3.21
> > (3) no no 224
> >
> > search results = 748 entries
> >
> >
> > Alex
> >
> >
> >
>
>
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Alex Karasulu <ao...@bellsouth.net>.
Ajay Upadhyaya wrote:
> Could you share the steps needed to do the indexing, or point to any
> existing doc... I'm relatively new to ADS,
Sure no problem. I wrote up a little documentation here on adding
indices. I just added this page just now very rapidly so it might be
sparse.
http://docs.safehaus.org/display/APACHEDS/Performance+Tuning
Alex
> Thanks,
> Ajay
>
> On 9/1/06, *Alex Karasulu * <aok123@bellsouth.net
> <ma...@bellsouth.net>> wrote:
>
> Hi all,
>
> I've been testing the search performance boost gained from indexing
> attributes before starting development on an optimization to improve
> index performance and in memory size. I thought I'd share these
> dramatic results of my pre-optimization tests with you since they
> clearly show the benefits of indices.
>
> Before I do this let me list the characteristics of the hardware used
> and my configuration settings:
>
>
> -------------
> Machine Setup
> -------------
>
> CPU: Dual Athlon MP 1900
> OS: Linux 2.6.15
> Mem: 2GB 266Mhz
>
>
> --------------
> ApacheDS Setup
> --------------
>
> ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
> - Using 1024MB Memory
> - Indexed st and initials
>
>
> ----------
> Data Setup
> ----------
>
> Wrote a simple tool to generate random values for descent sized entries.
> The data sort of looks like this for a user entry:
>
> dn: uid=user.1,ou=users,dc=example,dc=com
> uid: user.1
> initials: yq
> description: cFocJATNuhlXisDCqGtY
> pager: FYyimqyZRW
> cn: HSGMzajYKmicUTe
> postalcode: WiXXA
> st: xy
> street: kpCCqmrsCzkpdtHXWMfY
> l: KqmAXFYTrI
> objectclass: person
> objectclass: organizationalPerson
> objectclass: inetOrgPerson
> sn: nuymgOwpm
> homephone: PERamkCtsv
> mobile: vkIviOGNTC
> telephonenumber: 7248889026
> mail: pYvEoOjSnEymcWD
> givenname: IVHJZB
> postaladdress: crObexKoUTIFdzNHcZMr
> employeenumber: 1
> userpassword:: cGFzc3dvcmQ=
>
> I started loading a partition up with these entries 100,000 of them at a
> time then performing the following searches for all entries with
> initials aa:
>
> (1) index on initials but no cached entries
> (2) index on initials with cached entries
> (3) no index without cached entries
>
> Here are the results at the various capacities:
>
> ---------------
> 100,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 3.30
> (2) yes yes 0.72
> (3) no no 30.63
>
> search results = 153 entries
>
>
> ---------------
> 200,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 6.04
> (2) yes yes 1.44
> (3) no no 82
>
> search results = 302 entries
>
>
> ---------------
> 300,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 7.54
> (2) yes yes 1.95
> (3) no no 146
>
> search results = 451 entries
>
>
> ---------------
> 400,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 9.24
> (2) yes yes 3.80
> (3) no no 196
>
> search results = 586 entries
>
>
> ---------------
> 500,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 11.96
> (2) yes yes 3.21
> (3) no no 224
>
> search results = 748 entries
>
>
> Alex
>
>
>
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Ajay Upadhyaya <aj...@gmail.com>.
Could you share the steps needed to do the indexing, or point to any
existing doc... I'm relatively new to ADS,
Thanks,
Ajay
On 9/1/06, Alex Karasulu <ao...@bellsouth.net> wrote:
>
> Hi all,
>
> I've been testing the search performance boost gained from indexing
> attributes before starting development on an optimization to improve
> index performance and in memory size. I thought I'd share these
> dramatic results of my pre-optimization tests with you since they
> clearly show the benefits of indices.
>
> Before I do this let me list the characteristics of the hardware used
> and my configuration settings:
>
>
> -------------
> Machine Setup
> -------------
>
> CPU: Dual Athlon MP 1900
> OS: Linux 2.6.15
> Mem: 2GB 266Mhz
>
>
> --------------
> ApacheDS Setup
> --------------
>
> ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
> - Using 1024MB Memory
> - Indexed st and initials
>
>
> ----------
> Data Setup
> ----------
>
> Wrote a simple tool to generate random values for descent sized entries.
> The data sort of looks like this for a user entry:
>
> dn: uid=user.1,ou=users,dc=example,dc=com
> uid: user.1
> initials: yq
> description: cFocJATNuhlXisDCqGtY
> pager: FYyimqyZRW
> cn: HSGMzajYKmicUTe
> postalcode: WiXXA
> st: xy
> street: kpCCqmrsCzkpdtHXWMfY
> l: KqmAXFYTrI
> objectclass: person
> objectclass: organizationalPerson
> objectclass: inetOrgPerson
> sn: nuymgOwpm
> homephone: PERamkCtsv
> mobile: vkIviOGNTC
> telephonenumber: 7248889026
> mail: pYvEoOjSnEymcWD
> givenname: IVHJZB
> postaladdress: crObexKoUTIFdzNHcZMr
> employeenumber: 1
> userpassword:: cGFzc3dvcmQ=
>
> I started loading a partition up with these entries 100,000 of them at a
> time then performing the following searches for all entries with
> initials aa:
>
> (1) index on initials but no cached entries
> (2) index on initials with cached entries
> (3) no index without cached entries
>
> Here are the results at the various capacities:
>
> ---------------
> 100,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 3.30
> (2) yes yes 0.72
> (3) no no 30.63
>
> search results = 153 entries
>
>
> ---------------
> 200,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 6.04
> (2) yes yes 1.44
> (3) no no 82
>
> search results = 302 entries
>
>
> ---------------
> 300,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 7.54
> (2) yes yes 1.95
> (3) no no 146
>
> search results = 451 entries
>
>
> ---------------
> 400,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 9.24
> (2) yes yes 3.80
> (3) no no 196
>
> search results = 586 entries
>
>
> ---------------
> 500,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 11.96
> (2) yes yes 3.21
> (3) no no 224
>
> search results = 748 entries
>
>
> Alex
>
>
>
Ready for GA? (Was Re: [ApacheDS] [Performance] Using indices to
boost search performance)
Posted by Alex Karasulu <ao...@bellsouth.net>.
I think with this last optimization for greater capacity the server is
ready to handle serious datasets and can be made generally available.
Thoughts?
Alex
Alex Karasulu wrote:
> Hi all,
>
>
> As promised here are some more statistics as we fill up a single
> partition up to 2 million entries.
>
>
> ---------------
> 750,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 7.26
> (2) yes yes 5.04
> (3) no no 565 (9m25s)
>
> #1-#2 = 2.22
>
> search results = 1039 entries
>
> NOTE:
> -----
> I started changing settings a little here on. I did this because the
> number of entries returned with cache was way more than what we had
> allocated. I changed the memory settings and cache setting from the
> defaults. Here's what I did:
>
> o up'ed memory from 384 MB to 1024 MB
> o increased initials cache to 5000 from 100
> o increased entry cache from 100 to 5000
> o increased system indices except alias indices from 100 to 1000
>
> I turned things up a bit so I can load the database as fast as possible
> since we're increasing the amount added per run to 250K entries now.
>
> Let's do the 750K again:
>
> ---------------
> 750,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 8.22
> (2) yes yes 2.61
> (3) no no stopped
>
> #1-#2 = 5.61
>
> search results = 1039 entries
>
>
> -----------------
> 1,000,000 Entries
> -----------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 10.22
> (2) yes yes 3.59
>
> #1-#2 = 6.63
> search results = 1373 entries
>
>
>
> -----------------
> 1,250,000 Entries
> -----------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 12.71
> (2) yes yes 4.50
>
> #1-#2 = 8.21
> search results = 1738 entries
>
>
> -----------------
> 1,500,000 Entries
> -----------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 14.74
> (2) yes yes 11.84
>
>
> #1-#2 = 2.9
> search results = 2123 entries
>
> Now let's search this 1.5 million entry partition by limiting the number
> of returned entries to 1000.
>
> no cache: 8.10
> w/ cache: 3.04
>
> #1-#2 = 5.06
>
> Very similar to the values we got for 0.75 million entries at 1039
> entries returned.
>
> For the last value I'll load the partition up to 2 million entries and
> run the same tests again:
>
> -----------------
> 2,000,000 Entries
> -----------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 18.58
> (2) yes yes 15.38
>
>
> #1-#2 = 3.1
> search results = 2851 entries
>
> Now let's search this 2 million entry partition by limiting the number
> of returned entries to 1000.
>
> no cache: 8.04
> w/ cache: 2.55
> #1-#2 = 5.49
>
> Wow performance for returning 1000 entries is not diminishing as we
> scale from 1 million to 2 million entries. Both for cached and
> non-cached returns.
>
> Looking at these results I think we can easily accommodate 10 million
> entries per partition. Perhaps with a capacity limit on the order of
> 100 million entries per partition. Note that you can have as many
> partitions as you like in a single server.
>
> Alex
>
>
> Alex Karasulu wrote:
>> Hi again,
>>
>> I've completed my optimization on indices under high partition
>> capacities. The results are in line and side by side to the old
>> results. For those not interested in the exact optimization that took
>> place skip down to the results.
>>
>>
>> The problem:
>> ------------
>>
>> Up to now duplicate keys (keys with many values) were stored using a
>> TreeSet within the B+Trees of indices. Indices usually map some
>> attribute value to an ID into the master table for an entry. The ID
>> is a sequentially unique value assigned to the entry.
>>
>> So if 100 people have initials AA then there would be 100 values in
>> the TreeSet for the key 'AA' in the intials index. This would be
>> serialized and deserialized to and from disk. At these low numbers
>> there's really very little impact. However certain indices were
>> seriously impacted like the objectClass index or the hierarchy based
>> system index. Just imagine having 1 million entries of
>> inetOrgPersons. The objectClass index would then have 1 million
>> values in the TreeSet for the key 'inetOrgPerson'. This would
>> seriously hurt performance from both a space and time perspective. It
>> would essentially obfuscate the benefits of using BTree's in the first
>> place with large numbers of entries.
>>
>>
>> Solution:
>> ---------
>>
>> What I did was add a configurable threshold parameter when instead of
>> using a TreeSet to store duplicates another B+Tree was created and
>> used within the same index database file. Right now the default value
>> for this which these tests were conducted with is 1000. So after
>> having 1000 duplicate values the server switches from using in memory
>> TreeSets to on disk B+Trees for only storing values for that key.
>>
>>
>> Alex Karasulu wrote:
>>> Hi all,
>>>
>>> I've been testing the search performance boost gained from indexing
>>> attributes before starting development on an optimization to improve
>>> index performance and in memory size. I thought I'd share these
>>> dramatic results of my pre-optimization tests with you since they
>>> clearly show the benefits of indices.
>>>
>>> Before I do this let me list the characteristics of the hardware used
>>> and my configuration settings:
>>>
>>>
>>> -------------
>>> Machine Setup
>>> -------------
>>>
>>> CPU: Dual Athlon MP 1900
>>> OS: Linux 2.6.15
>>> Mem: 2GB 266Mhz
>>>
>>>
>>> --------------
>>> ApacheDS Setup
>>> --------------
>>>
>>> ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
>>> - Using 1024MB Memory
>>> - Indexed st and initials
>>>
>>>
>>> ----------
>>> Data Setup
>>> ----------
>>>
>>> Wrote a simple tool to generate random values for descent sized entries.
>>> The data sort of looks like this for a user entry:
>>>
>>> dn: uid=user.1,ou=users,dc=example,dc=com
>>> uid: user.1
>>> initials: yq
>>> description: cFocJATNuhlXisDCqGtY
>>> pager: FYyimqyZRW
>>> cn: HSGMzajYKmicUTe
>>> postalcode: WiXXA
>>> st: xy street: kpCCqmrsCzkpdtHXWMfY
>>> l: KqmAXFYTrI
>>> objectclass: person
>>> objectclass: organizationalPerson
>>> objectclass: inetOrgPerson
>>> sn: nuymgOwpm
>>> homephone: PERamkCtsv
>>> mobile: vkIviOGNTC
>>> telephonenumber: 7248889026
>>> mail: pYvEoOjSnEymcWD
>>> givenname: IVHJZB
>>> postaladdress: crObexKoUTIFdzNHcZMr
>>> employeenumber: 1
>>> userpassword:: cGFzc3dvcmQ=
>>>
>>> I started loading a partition up with these entries 100,000 of them at a
>>> time then performing the following searches for all entries with
>>> initials aa:
>>>
>>> (1) index on initials but no cached entries
>>> (2) index on initials with cached entries
>>> (3) no index without cached entries
>>>
>>> Here are the results at the various capacities:
>>>
>>> ---------------
>>> 100,000 Entries
>>> ---------------
>>>
>>> [cached] [indexed] [time (seconds)]
>>>
>>> (1) no yes 3.30
>> 2.37
>>> (2) yes yes 0.72
>> 0.76
>>> (3) no no 30.63
>> 42.08
>>
>>> search results = 153 entries
>> 146
>>
>> Notice that if #2 is total time for cached entries we can conclude
>> that #1 - #2 is a descent measure of pulling from disk. So let's see
>> what that number looks like:
>>
>> Old: 2.58
>> New: 1.61
>>
>>>
>>>
>>> ---------------
>>> 200,000 Entries
>>> ---------------
>>>
>>> [cached] [indexed] [time (seconds)]
>>>
>>> (1) no yes 6.04
>> 3.29
>>> (2) yes yes 1.44
>> 1.49
>>> (3) no no 82
>> 83
>>>
>>> search results = 302 entries
>> 297
>> #1 - #2 values:
>>
>> Old: 4.60
>> New: 1.80
>>
>>>
>>>
>>> ---------------
>>> 300,000 Entries
>>> ---------------
>>>
>>> [cached] [indexed] [time (seconds)]
>>>
>>> (1) no yes 7.54
>> 4.11
>>> (2) yes yes 1.95
>> 2.22
>>> (3) no no 146
>> 128
>>>
>>> search results = 451 entries
>> 435
>>
>> #1 - #2 values:
>>
>> Old: 5.59
>> New: 1.89
>>
>> Before the optimization there seems to be a linear growth to access
>> time whereas after the optimization the growth rate is almost
>> unmeasurable above the noise.
>>
>> -Alex
>>
>>
>
>
>>>
>>>
>>> ---------------
>>> 400,000 Entries
>>> ---------------
>>>
>>> [cached] [indexed] [time (seconds)]
>>>
>>> (1) no yes 9.24
>> 4.87
>>> (2) yes yes 3.80
>> 2.69
>>> (3) no no 196
>> 168
>>>
>>> search results = 586 entries
>> 577
>>
>> #1 - #2 values:
>>
>> Old: 5.44
>> New: 2.18
>>
>>>
>>>
>>> ---------------
>>> 500,000 Entries
>>> ---------------
>>>
>>> [cached] [indexed] [time (seconds)]
>>>
>>> (1) no yes 11.96
>> 5.06
>>> (2) yes yes 3.21
>> 3.21
>>> (3) no no 224
>> 209
>>>
>>> search results = 748 entries
>> 706
>>
>> #1 - #2 values:
>>
>> Old: 8.75
>> New: 1.85
>>
>>
>> Besides these numbers I've noticed that adds are now faster and total
>> garbage collection is lower. These were eyeballed figures. For
>> example the adds would grow in time up to 28 minutes per 100K entries
>> for those between 400K-500K. This has dropped down to 18 minutes.
>>
>> Now it would be interesting to see what happens with these trends once
>> the intials index converts from a TreeSet to a BTree for the values of
>> the key 'aa'. So just to see we keep loading up the partition until
>> there are over 1000 entries with initials 'aa'.
>>
>> I'll report on this tomorrow as the server loads these entries.
>>
>> I guess the conclusion here is that the optimization had a significant
>> effect and is worth the additional complexity it introduced into the
>> JDBM partition implementation. Here's the access times (#1-#2) as a
>> function of the number of entries:
>>
>>
>> Retrieval Times | 100K | 200K | 300K | 400K | 500K
>> --------------------------------------------------
>> Before: | 2.58 | 4.60 | 5.59 | 5.44 | 8.75
>> After: | 1.61 | 1.80 | 1.89 | 2.18 | 1.85
>>
>>
>> Before the optimization there seems to be a linear growth to access
>> time whereas after the optimization the growth rate is almost
>> unmeasurable above the noise.
>>
>> -Alex
>>
>>
>
>
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Alex Karasulu <ao...@bellsouth.net>.
Hi all,
As promised here are some more statistics as we fill up a single
partition up to 2 million entries.
---------------
750,000 Entries
---------------
[cached] [indexed] [time (seconds)]
(1) no yes 7.26
(2) yes yes 5.04
(3) no no 565 (9m25s)
#1-#2 = 2.22
search results = 1039 entries
NOTE:
-----
I started changing settings a little here on. I did this because the
number of entries returned with cache was way more than what we had
allocated. I changed the memory settings and cache setting from the
defaults. Here's what I did:
o up'ed memory from 384 MB to 1024 MB
o increased initials cache to 5000 from 100
o increased entry cache from 100 to 5000
o increased system indices except alias indices from 100 to 1000
I turned things up a bit so I can load the database as fast as possible
since we're increasing the amount added per run to 250K entries now.
Let's do the 750K again:
---------------
750,000 Entries
---------------
[cached] [indexed] [time (seconds)]
(1) no yes 8.22
(2) yes yes 2.61
(3) no no stopped
#1-#2 = 5.61
search results = 1039 entries
-----------------
1,000,000 Entries
-----------------
[cached] [indexed] [time (seconds)]
(1) no yes 10.22
(2) yes yes 3.59
#1-#2 = 6.63
search results = 1373 entries
-----------------
1,250,000 Entries
-----------------
[cached] [indexed] [time (seconds)]
(1) no yes 12.71
(2) yes yes 4.50
#1-#2 = 8.21
search results = 1738 entries
-----------------
1,500,000 Entries
-----------------
[cached] [indexed] [time (seconds)]
(1) no yes 14.74
(2) yes yes 11.84
#1-#2 = 2.9
search results = 2123 entries
Now let's search this 1.5 million entry partition by limiting the number
of returned entries to 1000.
no cache: 8.10
w/ cache: 3.04
#1-#2 = 5.06
Very similar to the values we got for 0.75 million entries at 1039
entries returned.
For the last value I'll load the partition up to 2 million entries and
run the same tests again:
-----------------
2,000,000 Entries
-----------------
[cached] [indexed] [time (seconds)]
(1) no yes 18.58
(2) yes yes 15.38
#1-#2 = 3.1
search results = 2851 entries
Now let's search this 2 million entry partition by limiting the number
of returned entries to 1000.
no cache: 8.04
w/ cache: 2.55
#1-#2 = 5.49
Wow performance for returning 1000 entries is not diminishing as we
scale from 1 million to 2 million entries. Both for cached and
non-cached returns.
Looking at these results I think we can easily accommodate 10 million
entries per partition. Perhaps with a capacity limit on the order of
100 million entries per partition. Note that you can have as many
partitions as you like in a single server.
Alex
Alex Karasulu wrote:
> Hi again,
>
> I've completed my optimization on indices under high partition
> capacities. The results are in line and side by side to the old
> results. For those not interested in the exact optimization that took
> place skip down to the results.
>
>
> The problem:
> ------------
>
> Up to now duplicate keys (keys with many values) were stored using a
> TreeSet within the B+Trees of indices. Indices usually map some
> attribute value to an ID into the master table for an entry. The ID is
> a sequentially unique value assigned to the entry.
>
> So if 100 people have initials AA then there would be 100 values in the
> TreeSet for the key 'AA' in the intials index. This would be serialized
> and deserialized to and from disk. At these low numbers there's really
> very little impact. However certain indices were seriously impacted
> like the objectClass index or the hierarchy based system index. Just
> imagine having 1 million entries of inetOrgPersons. The objectClass
> index would then have 1 million values in the TreeSet for the key
> 'inetOrgPerson'. This would seriously hurt performance from both a
> space and time perspective. It would essentially obfuscate the benefits
> of using BTree's in the first place with large numbers of entries.
>
>
> Solution:
> ---------
>
> What I did was add a configurable threshold parameter when instead of
> using a TreeSet to store duplicates another B+Tree was created and used
> within the same index database file. Right now the default value for
> this which these tests were conducted with is 1000. So after having
> 1000 duplicate values the server switches from using in memory TreeSets
> to on disk B+Trees for only storing values for that key.
>
>
> Alex Karasulu wrote:
>> Hi all,
>>
>> I've been testing the search performance boost gained from indexing
>> attributes before starting development on an optimization to improve
>> index performance and in memory size. I thought I'd share these
>> dramatic results of my pre-optimization tests with you since they
>> clearly show the benefits of indices.
>>
>> Before I do this let me list the characteristics of the hardware used
>> and my configuration settings:
>>
>>
>> -------------
>> Machine Setup
>> -------------
>>
>> CPU: Dual Athlon MP 1900
>> OS: Linux 2.6.15
>> Mem: 2GB 266Mhz
>>
>>
>> --------------
>> ApacheDS Setup
>> --------------
>>
>> ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
>> - Using 1024MB Memory
>> - Indexed st and initials
>>
>>
>> ----------
>> Data Setup
>> ----------
>>
>> Wrote a simple tool to generate random values for descent sized entries.
>> The data sort of looks like this for a user entry:
>>
>> dn: uid=user.1,ou=users,dc=example,dc=com
>> uid: user.1
>> initials: yq
>> description: cFocJATNuhlXisDCqGtY
>> pager: FYyimqyZRW
>> cn: HSGMzajYKmicUTe
>> postalcode: WiXXA
>> st: xy street: kpCCqmrsCzkpdtHXWMfY
>> l: KqmAXFYTrI
>> objectclass: person
>> objectclass: organizationalPerson
>> objectclass: inetOrgPerson
>> sn: nuymgOwpm
>> homephone: PERamkCtsv
>> mobile: vkIviOGNTC
>> telephonenumber: 7248889026
>> mail: pYvEoOjSnEymcWD
>> givenname: IVHJZB
>> postaladdress: crObexKoUTIFdzNHcZMr
>> employeenumber: 1
>> userpassword:: cGFzc3dvcmQ=
>>
>> I started loading a partition up with these entries 100,000 of them at a
>> time then performing the following searches for all entries with
>> initials aa:
>>
>> (1) index on initials but no cached entries
>> (2) index on initials with cached entries
>> (3) no index without cached entries
>>
>> Here are the results at the various capacities:
>>
>> ---------------
>> 100,000 Entries
>> ---------------
>>
>> [cached] [indexed] [time (seconds)]
>>
>> (1) no yes 3.30
> 2.37
>> (2) yes yes 0.72
> 0.76
>> (3) no no 30.63
> 42.08
>
>> search results = 153 entries
> 146
>
> Notice that if #2 is total time for cached entries we can conclude that
> #1 - #2 is a descent measure of pulling from disk. So let's see what
> that number looks like:
>
> Old: 2.58
> New: 1.61
>
>>
>>
>> ---------------
>> 200,000 Entries
>> ---------------
>>
>> [cached] [indexed] [time (seconds)]
>>
>> (1) no yes 6.04
> 3.29
>> (2) yes yes 1.44
> 1.49
>> (3) no no 82
> 83
>>
>> search results = 302 entries
> 297
> #1 - #2 values:
>
> Old: 4.60
> New: 1.80
>
>>
>>
>> ---------------
>> 300,000 Entries
>> ---------------
>>
>> [cached] [indexed] [time (seconds)]
>>
>> (1) no yes 7.54
> 4.11
>> (2) yes yes 1.95
> 2.22
>> (3) no no 146
> 128
>>
>> search results = 451 entries
> 435
>
> #1 - #2 values:
>
> Old: 5.59
> New: 1.89
>
> Before the optimization there seems to be a linear growth to access time
> whereas after the optimization the growth rate is almost unmeasurable
> above the noise.
>
> -Alex
>
>
>>
>>
>> ---------------
>> 400,000 Entries
>> ---------------
>>
>> [cached] [indexed] [time (seconds)]
>>
>> (1) no yes 9.24
> 4.87
>> (2) yes yes 3.80
> 2.69
>> (3) no no 196
> 168
>>
>> search results = 586 entries
> 577
>
> #1 - #2 values:
>
> Old: 5.44
> New: 2.18
>
>>
>>
>> ---------------
>> 500,000 Entries
>> ---------------
>>
>> [cached] [indexed] [time (seconds)]
>>
>> (1) no yes 11.96
> 5.06
>> (2) yes yes 3.21
> 3.21
>> (3) no no 224
> 209
>>
>> search results = 748 entries
> 706
>
> #1 - #2 values:
>
> Old: 8.75
> New: 1.85
>
>
> Besides these numbers I've noticed that adds are now faster and total
> garbage collection is lower. These were eyeballed figures. For example
> the adds would grow in time up to 28 minutes per 100K entries for those
> between 400K-500K. This has dropped down to 18 minutes.
>
> Now it would be interesting to see what happens with these trends once
> the intials index converts from a TreeSet to a BTree for the values of
> the key 'aa'. So just to see we keep loading up the partition until
> there are over 1000 entries with initials 'aa'.
>
> I'll report on this tomorrow as the server loads these entries.
>
> I guess the conclusion here is that the optimization had a significant
> effect and is worth the additional complexity it introduced into the
> JDBM partition implementation. Here's the access times (#1-#2) as a
> function of the number of entries:
>
>
> Retrieval Times | 100K | 200K | 300K | 400K | 500K
> --------------------------------------------------
> Before: | 2.58 | 4.60 | 5.59 | 5.44 | 8.75
> After: | 1.61 | 1.80 | 1.89 | 2.18 | 1.85
>
>
> Before the optimization there seems to be a linear growth to access time
> whereas after the optimization the growth rate is almost unmeasurable
> above the noise.
>
> -Alex
>
>
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Alex Karasulu <ao...@bellsouth.net>.
Hi again,
I've completed my optimization on indices under high partition
capacities. The results are in line and side by side to the old
results. For those not interested in the exact optimization that took
place skip down to the results.
The problem:
------------
Up to now duplicate keys (keys with many values) were stored using a
TreeSet within the B+Trees of indices. Indices usually map some
attribute value to an ID into the master table for an entry. The ID is
a sequentially unique value assigned to the entry.
So if 100 people have initials AA then there would be 100 values in the
TreeSet for the key 'AA' in the intials index. This would be serialized
and deserialized to and from disk. At these low numbers there's really
very little impact. However certain indices were seriously impacted
like the objectClass index or the hierarchy based system index. Just
imagine having 1 million entries of inetOrgPersons. The objectClass
index would then have 1 million values in the TreeSet for the key
'inetOrgPerson'. This would seriously hurt performance from both a
space and time perspective. It would essentially obfuscate the benefits
of using BTree's in the first place with large numbers of entries.
Solution:
---------
What I did was add a configurable threshold parameter when instead of
using a TreeSet to store duplicates another B+Tree was created and used
within the same index database file. Right now the default value for
this which these tests were conducted with is 1000. So after having
1000 duplicate values the server switches from using in memory TreeSets
to on disk B+Trees for only storing values for that key.
Alex Karasulu wrote:
> Hi all,
>
> I've been testing the search performance boost gained from indexing
> attributes before starting development on an optimization to improve
> index performance and in memory size. I thought I'd share these
> dramatic results of my pre-optimization tests with you since they
> clearly show the benefits of indices.
>
> Before I do this let me list the characteristics of the hardware used
> and my configuration settings:
>
>
> -------------
> Machine Setup
> -------------
>
> CPU: Dual Athlon MP 1900
> OS: Linux 2.6.15
> Mem: 2GB 266Mhz
>
>
> --------------
> ApacheDS Setup
> --------------
>
> ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
> - Using 1024MB Memory
> - Indexed st and initials
>
>
> ----------
> Data Setup
> ----------
>
> Wrote a simple tool to generate random values for descent sized entries.
> The data sort of looks like this for a user entry:
>
> dn: uid=user.1,ou=users,dc=example,dc=com
> uid: user.1
> initials: yq
> description: cFocJATNuhlXisDCqGtY
> pager: FYyimqyZRW
> cn: HSGMzajYKmicUTe
> postalcode: WiXXA
> st: xy
> street: kpCCqmrsCzkpdtHXWMfY
> l: KqmAXFYTrI
> objectclass: person
> objectclass: organizationalPerson
> objectclass: inetOrgPerson
> sn: nuymgOwpm
> homephone: PERamkCtsv
> mobile: vkIviOGNTC
> telephonenumber: 7248889026
> mail: pYvEoOjSnEymcWD
> givenname: IVHJZB
> postaladdress: crObexKoUTIFdzNHcZMr
> employeenumber: 1
> userpassword:: cGFzc3dvcmQ=
>
> I started loading a partition up with these entries 100,000 of them at a
> time then performing the following searches for all entries with
> initials aa:
>
> (1) index on initials but no cached entries
> (2) index on initials with cached entries
> (3) no index without cached entries
>
> Here are the results at the various capacities:
>
> ---------------
> 100,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 3.30
2.37
> (2) yes yes 0.72
0.76
> (3) no no 30.63
42.08
> search results = 153 entries
146
Notice that if #2 is total time for cached entries we can conclude that
#1 - #2 is a descent measure of pulling from disk. So let's see what
that number looks like:
Old: 2.58
New: 1.61
>
>
> ---------------
> 200,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 6.04
3.29
> (2) yes yes 1.44
1.49
> (3) no no 82
83
>
> search results = 302 entries
297
#1 - #2 values:
Old: 4.60
New: 1.80
>
>
> ---------------
> 300,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 7.54
4.11
> (2) yes yes 1.95
2.22
> (3) no no 146
128
>
> search results = 451 entries
435
#1 - #2 values:
Old: 5.59
New: 1.89
>
>
> ---------------
> 400,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 9.24
4.87
> (2) yes yes 3.80
2.69
> (3) no no 196
168
>
> search results = 586 entries
577
#1 - #2 values:
Old: 5.44
New: 2.18
>
>
> ---------------
> 500,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 11.96
5.06
> (2) yes yes 3.21
3.21
> (3) no no 224
209
>
> search results = 748 entries
706
#1 - #2 values:
Old: 8.75
New: 1.85
Besides these numbers I've noticed that adds are now faster and total
garbage collection is lower. These were eyeballed figures. For example
the adds would grow in time up to 28 minutes per 100K entries for those
between 400K-500K. This has dropped down to 18 minutes.
Now it would be interesting to see what happens with these trends once
the intials index converts from a TreeSet to a BTree for the values of
the key 'aa'. So just to see we keep loading up the partition until
there are over 1000 entries with initials 'aa'.
I'll report on this tomorrow as the server loads these entries.
I guess the conclusion here is that the optimization had a significant
effect and is worth the additional complexity it introduced into the
JDBM partition implementation. Here's the access times (#1-#2) as a
function of the number of entries:
Retrieval Times | 100K | 200K | 300K | 400K | 500K
--------------------------------------------------
Before: | 2.58 | 4.60 | 5.59 | 5.44 | 8.75
After: | 1.61 | 1.80 | 1.89 | 2.18 | 1.85
Before the optimization there seems to be a linear growth to access time
whereas after the optimization the growth rate is almost unmeasurable
above the noise.
-Alex
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Ole Ersoy <ol...@yahoo.com>.
Oh - Sorry for the confusion - I meant that as a
compliment - But nice insight addition,
- Ole
--- Emmanuel Lecharny <el...@gmail.com> wrote:
> Ole Ersoy a écrit :
>
> >I'm going to read this email as soon as the smoke
> >clears...
> >
> >
> Nothing really complicated :)
>
> The idea is to do a performance measure before doing
> some optimisation
> on some part of the code, so Alex will have a clear
> vision of what kind
> of performance improvment he will obtain.
>
> Here, what we can say from those tests, is that the
> indices are not
> really optimal : it seems that the time to grab
> entries grow linearly
> with the growth of the database. For instance, from
> 300 000 entries to
> 400 000 entries, we have an increase in time to
> search for some entries
> which is similar than from 200 000 to 300 000 : 6.04
> sec for 200K, 7.14
> for 300K, 9.24 for 400K. We would excpect the growth
> to be logarithmic.
>
> Alex is working on that currently.
>
> There is a page on confluence where we have put some
> thought about
> backend, and it would be cool to add any comment,
> thought, improvment,
> ideas :
> http://docs.safehaus.org/display/APACHEDS/Backend
>
> Emmanuel.
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Emmanuel Lecharny <el...@gmail.com>.
Ole Ersoy a écrit :
>I'm going to read this email as soon as the smoke
>clears...
>
>
Nothing really complicated :)
The idea is to do a performance measure before doing some optimisation
on some part of the code, so Alex will have a clear vision of what kind
of performance improvment he will obtain.
Here, what we can say from those tests, is that the indices are not
really optimal : it seems that the time to grab entries grow linearly
with the growth of the database. For instance, from 300 000 entries to
400 000 entries, we have an increase in time to search for some entries
which is similar than from 200 000 to 300 000 : 6.04 sec for 200K, 7.14
for 300K, 9.24 for 400K. We would excpect the growth to be logarithmic.
Alex is working on that currently.
There is a page on confluence where we have put some thought about
backend, and it would be cool to add any comment, thought, improvment,
ideas :
http://docs.safehaus.org/display/APACHEDS/Backend
Emmanuel.
Re: [ApacheDS] [Performance] Using indices to boost search performance
Posted by Ole Ersoy <ol...@yahoo.com>.
I'm going to read this email as soon as the smoke
clears...
--- Alex Karasulu <ao...@bellsouth.net> wrote:
> Hi all,
>
> I've been testing the search performance boost
> gained from indexing
> attributes before starting development on an
> optimization to improve
> index performance and in memory size. I thought I'd
> share these
> dramatic results of my pre-optimization tests with
> you since they
> clearly show the benefits of indices.
>
> Before I do this let me list the characteristics of
> the hardware used
> and my configuration settings:
>
>
> -------------
> Machine Setup
> -------------
>
> CPU: Dual Athlon MP 1900
> OS: Linux 2.6.15
> Mem: 2GB 266Mhz
>
>
> --------------
> ApacheDS Setup
> --------------
>
> ApacheDS: Stock RC4 (to be released pre-image) w/
> modifications
> - Using 1024MB Memory
> - Indexed st and initials
>
>
> ----------
> Data Setup
> ----------
>
> Wrote a simple tool to generate random values for
> descent sized entries.
> The data sort of looks like this for a user entry:
>
> dn: uid=user.1,ou=users,dc=example,dc=com
> uid: user.1
> initials: yq
> description: cFocJATNuhlXisDCqGtY
> pager: FYyimqyZRW
> cn: HSGMzajYKmicUTe
> postalcode: WiXXA
> st: xy
> street: kpCCqmrsCzkpdtHXWMfY
> l: KqmAXFYTrI
> objectclass: person
> objectclass: organizationalPerson
> objectclass: inetOrgPerson
> sn: nuymgOwpm
> homephone: PERamkCtsv
> mobile: vkIviOGNTC
> telephonenumber: 7248889026
> mail: pYvEoOjSnEymcWD
> givenname: IVHJZB
> postaladdress: crObexKoUTIFdzNHcZMr
> employeenumber: 1
> userpassword:: cGFzc3dvcmQ=
>
> I started loading a partition up with these entries
> 100,000 of them at a
> time then performing the following searches for all
> entries with
> initials aa:
>
> (1) index on initials but no cached entries
> (2) index on initials with cached entries
> (3) no index without cached entries
>
> Here are the results at the various capacities:
>
> ---------------
> 100,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 3.30
> (2) yes yes 0.72
> (3) no no 30.63
>
> search results = 153 entries
>
>
> ---------------
> 200,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 6.04
> (2) yes yes 1.44
> (3) no no 82
>
> search results = 302 entries
>
>
> ---------------
> 300,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 7.54
> (2) yes yes 1.95
> (3) no no 146
>
> search results = 451 entries
>
>
> ---------------
> 400,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 9.24
> (2) yes yes 3.80
> (3) no no 196
>
> search results = 586 entries
>
>
> ---------------
> 500,000 Entries
> ---------------
>
> [cached] [indexed] [time (seconds)]
>
> (1) no yes 11.96
> (2) yes yes 3.21
> (3) no no 224
>
> search results = 748 entries
>
>
> Alex
>
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com