You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2014/06/22 17:26:03 UTC

Bulk load profiling

Hi Kiran,

I did a bit of profiling today, and was able to improve the perfs by 7%.
The method I speeded up is PrepareString. I created a specific method
which does not crerate a new char[] when we are dealing with ASCII chars
only. The gain is huge.

Otherwise, most of the time is -as expected- spent in the
deserialization of entries read from the MasterTable.

At this point, I think we should think about what we can do to avoid
such cost. Most of the time, we will have enough memory to load all the
elements that will be stored into an index. I'm wondering if it would
not be better to parse the LDIF once, gather what we can in memory (but
not keeping the whole entry in memory) and build the index directly,
then process the master table.

It's not easy, because we can't know how much elements we can store in
memory, and when we reach the memory limit, then we have to do something
which is completely different. If we decide to deal with the memory
limitation from the beginning, we will pay the price and it will be
expensive. OTOH, most of the time we won't have to care about the memory
for two reasons :
- either we have to deal with a limited number of entries in the ldif file
- or we have enough memory to handle the whole file (on my computer, I
can provide 14Gb to the JVM, enough to process 5M entries if each one of
them is 1kb large)

I'm now thinking that it would be better to have 2 possible algorithm :
- a in-memory one, which does not care aboyt what could happen when we
reach the end of the memory
- a 'smarter' one which take control when we get an OOM

This can be done the same way we do with the DN parser : we have a fast
parser, which throw an exception if it sees a special case, and a full
parser. Same here, but we catch the OOM instead.

Of course, we cna probably try to 'predict' which one to use when we
start the bulk load, to avoid spending time with the in-memory process.
Or we can let the user decide.

Wdyt ?

Re: Bulk load profiling

Posted by Kiran Ayyagari <ka...@apache.org>.

On Sun, Jun 22, 2014 at 8:56 PM, Emmanuel Lécharny <el...@gmail.com>
wrote:

> Hi Kiran,
>
> I did a bit of profiling today, and was able to improve the perfs by 7%.
> The method I speeded up is PrepareString. I created a specific method
> which does not crerate a new char[] when we are dealing with ASCII chars
> only. The gain is huge.
>
great, can you commit it?

>
> Otherwise, most of the time is -as expected- spent in the
> deserialization of entries read from the MasterTable.
>
> ok

> At this point, I think we should think about what we can do to avoid
> such cost. Most of the time, we will have enough memory to load all the
> elements that will be stored into an index. I'm wondering if it would
> not be better to parse the LDIF once, gather what we can in memory (but
> not keeping the whole entry in memory) and build the index directly,
> then process the master table.
>
> hmm, at least at one point we end up with keeping full entry

> It's not easy, because we can't know how much elements we can store in
>
yeah

> memory, and when we reach the memory limit, then we have to do something
> which is completely different. If we decide to deal with the memory
> limitation from the beginning, we will pay the price and it will be
> expensive. OTOH, most of the time we won't have to care about the memory
>
yep

> for two reasons :
> - either we have to deal with a limited number of entries in the ldif file
> - or we have enough memory to handle the whole file (on my computer, I
> can provide 14Gb to the JVM, enough to process 5M entries if each one of
> them is 1kb large)
>
> I'm now thinking that it would be better to have 2 possible algorithm :
> - a in-memory one, which does not care aboyt what could happen when we
> reach the end of the memory
> - a 'smarter' one which take control when we get an OOM
>
> +1

> This can be done the same way we do with the DN parser : we have a fast
> parser, which throw an exception if it sees a special case, and a full
> parser. Same here, but we catch the OOM instead.
>
> Of course, we cna probably try to 'predict' which one to use when we
> start the bulk load, to avoid spending time with the in-memory process.
> Or we can let the user decide.
>
> Wdyt ?
>
yep, been thinking about the earlier ideas as well, but for now just moved
the bulkloader to its own module

-- 
Kiran Ayyagari
http://keydap.com