You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2014/07/19 08:52:25 UTC

Mavibot bulkloader

Hi guys,

I have a bit of time this week-end, I'm reviewing the bulk loader code
and will try to imrpove it.

So far, we need to add a few imrpovements and features, AFAICS :

- unlimited number of entries. Currently, we are memory bound, as we
load in memory all the DN we read from the file. A merged sort could
help here (still to think more about this pb)
- we need to inject the config in the bulkloader in order to inject the
user indexes we want to load. This is currently not done (although we do
have the list of indexes to process)
- speed up - if we can.

If you have any other idea or suggestion, that would be very welcome !

Re: Mavibot bulkloader

Posted by Emmanuel Lécharny <el...@gmail.com>.
Le 20/07/2014 16:39, Emmanuel Lécharny a écrit :
> FTR, I have fixed a potential pb on windows (computation of the
> position, while processing the lines) 
Sadly I was a bit optimistic. I was trying to get the current position
from inside the BufferedReader using reflection, but that won't work in
all cases.

I see no other way but writing our own buffered reader on top of a
RandomAccessFile. Assuming we are processing a pure UTF-8 file (comments
may contains UTF-8 chars), this requires a bit of work.

Re: Mavibot bulkloader

Posted by Emmanuel Lécharny <el...@gmail.com>.
Le 20/07/2014 07:28, Kiran Ayyagari a écrit :
> On Sat, Jul 19, 2014 at 12:22 PM, Emmanuel Lécharny <el...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> I have a bit of time this week-end, I'm reviewing the bulk loader code
>> and will try to imrpove it.
>>
>> So far, we need to add a few imrpovements and features, AFAICS :
>>
>> - unlimited number of entries. Currently, we are memory bound, as we
>> load in memory all the DN we read from the file. A merged sort could
>> help here (still to think more about this pb)
>>
> I already have this piece implemented a while ago, can send the code if
> needed

Sure, that can help !

FTR, I have fixed a potential pb on windows (computation of the
position, while processing the lines)


Re: Mavibot bulkloader

Posted by Kiran Ayyagari <ka...@apache.org>.
On Sat, Jul 19, 2014 at 12:22 PM, Emmanuel Lécharny <el...@gmail.com>
wrote:

> Hi guys,
>
> I have a bit of time this week-end, I'm reviewing the bulk loader code
> and will try to imrpove it.
>
> So far, we need to add a few imrpovements and features, AFAICS :
>
> - unlimited number of entries. Currently, we are memory bound, as we
> load in memory all the DN we read from the file. A merged sort could
> help here (still to think more about this pb)
>
I already have this piece implemented a while ago, can send the code if
needed

> - we need to inject the config in the bulkloader in order to inject the
> user indexes we want to load. This is currently not done (although we do
> have the list of indexes to process)
> - speed up - if we can.
>
> If you have any other idea or suggestion, that would be very welcome !
>



-- 
Kiran Ayyagari
http://keydap.com

Re: Mavibot bulkloader

Posted by Emmanuel Lécharny <el...@gmail.com>.
Le 19/07/2014 08:52, Emmanuel Lécharny a écrit :
> Hi guys,
>
> I have a bit of time this week-end, I'm reviewing the bulk loader code
> and will try to imrpove it.
>
> So far, we need to add a few imrpovements and features, AFAICS :
>
> - unlimited number of entries. Currently, we are memory bound, as we
> load in memory all the DN we read from the file. A merged sort could
> help here (still to think more about this pb)
> - we need to inject the config in the bulkloader in order to inject the
> user indexes we want to load. This is currently not done (although we do
> have the list of indexes to process)
> - speed up - if we can.
>
> If you have any other idea or suggestion, that would be very welcome !

A small issue in the LdifReader class : we don't have a separate
initialization (ie, the initialization is done when the constructor is
called).

It has some impact in the way we parse the DN, as we do validate the
first one twice, the flag being set to true by default. It's also
impossible to use the DnFactory which caches some DN, which would be a
valuable thing to have.

It's minor, and does not impact the performances in a detectable way,
but for the clarity of teh code, it would be better to separate the
init() from the construction fo any LDIFReader class.