You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2015/02/20 21:32:36 UTC

[Mavibot] Status

Hi guys,

just a mail to explain what we are currently working on wrt Mavibot.

Recently, we have had a few user's mail and issues filed for ApacheDS
where Mavibot is being used. A couple of big issues has been fixed :
typically, browsing a BTree was not possible in some corner cases, due
to a bad implementation of a cursor. We also expect to use Mavibot more
and more in the next few months. Fortress, for instance, which has a
huge tests list, would benefit from the speedup.

Now, a few big things has to be implemented. I'm currently working on
the bulkLoader, which is workinhg when we have tuples with single
values, but it fails when we inject tuples with multiple values, when we
hit the value threashold (when we have more than 8 values, we don't
store them in an array, but we use a sub-btree). It's a bit complex to
implement, as we have to call the bulkloader inside the bulkloader, and
be careful not to re-write pages that we already have written previously
(this is what is currently happening atm, we create the sub-btree, which
uses 3 pages, then we write the pages, and we update the previous
sub-btree pages, which is a waste).

At this point, I think an improvement would be to merge the BtreeHeader
page and the BtreeInfo page, but taking care of not serializing the
BtreeInfo over and over. That would save some space on the disk,
especially for sub-btrees. Another option would be to avoid creating a
btree-info for sub-btrees, as they inherit everything from their
associated parent btree.

The next big step is to finish the implementation of the free page
management. This is really a must have. The first version should be a
simple one, with a thread in charge of reclaiming the free pages when
the oldest revision in used is released. This thread would then  reclaim
all the associated pages, and put them back in the free list, and goes
up reclaiming the versions that are orlder than the new oldest version.
Btw, this reclaimer thread must be the writer thread, as we don't want
to add some locks between those two threads (we need to figure out what
would be the best mechanism here).
A better algorithm, where we would reclaim unused pages in the middle of
used revision, has to be implemented later.

Last, not least, we have to add a transaction mechanism that allows the
updates to be done in batches, and in memory. Ie, we don't update the
disk, we just keep the modified pages in memory, so that when a page has
to be updated twice, we don't need to fetch it from the disk and write
it back. A simple hashMap keeping the pages associated with their offset
should be enough, but that means we should check on this hashmap before
fetching the page on the disk.
This feature brings to critical improvements :
- first, it allows a set of updates made across many btrees to be seen
as one single atomic operation. For the LDAP server, this is absolutely
critical, as we always update many btrees when we update an entry.
- second, this would speedup the updates, as the root pages, the headers
and more important, the free paages and the record manager header will
only be updated once, when they are updated once per update. I think the
gain could be around 20%, minimum.

Once those features are implemented, I'm quite sure that Mavibot 1.0 wil
be in good shape.

Thoughts ?

Re: [Mavibot] Status

Posted by Emmanuel Lécharny <el...@gmail.com>.

Le 03/03/15 19:50, Emmanuel Lécharny a écrit :
> A quick update...
>
> I was not able to work too much on mavibot those last two weeks. Still,
> some progress has ben made last week-end :
>
> - first, DIRSERVER_2047 has been investigated deeply. It was really
> problematic, as the browsefrom(K) method was producing various NPE and
> wrong result. It took long to get it fixed. We have to thank Lin Zhao
> for having been patient and helpful to get the pb fixed
>
> - second, I ran a quick performance test to see what's going on.
> Browsing a B-tree with 500 keys, each one of them having 9 values (which
> resolves to a sub-btree), after having positionned the cursor before a
> key from 0 to 499 took 9 seconds, and 23 seconds when browsing backward.
> It was a bit slow, considering we are fetching 500 * 250 tuples (500
> iterations, for an average of half the tuples being fetched). That was
> around 14 000 tuples get per second.
Correction : we are talking about browsing 500 * 250 * 9 tuples here,
making it able of browsing 125 000 tuples per second (before the fix),
as we have 9 values per key, thus each key will produces 9 tuples.

That makes the Browse operation even faster after the cache fix, with
around 450 000 tuples fetched per second...

Re: [Mavibot] Status

Posted by Emmanuel Lécharny <el...@gmail.com>.

Le 03/03/15 19:50, Emmanuel Lécharny a écrit :
> A quick update...
>
> I was not able to work too much on mavibot those last two weeks. Still,
> some progress has ben made last week-end :
>
> - first, DIRSERVER_2047 has been investigated deeply. It was really
> problematic, as the browsefrom(K) method was producing various NPE and
> wrong result. It took long to get it fixed. We have to thank Lin Zhao
> for having been patient and helpful to get the pb fixed
>
> - second, I ran a quick performance test to see what's going on.
> Browsing a B-tree with 500 keys, each one of them having 9 values (which
> resolves to a sub-btree), after having positionned the cursor before a
> key from 0 to 499 took 9 seconds, and 23 seconds when browsing backward.
> It was a bit slow, considering we are fetching 500 * 250 tuples (500
> iterations, for an average of half the tuples being fetched). That was
> around 14 000 tuples get per second.
>
> So I profiled the test. I as surprised to see that we were deserializing
> many many keys and values, which was not expected, as we have a cache.
> After a bit of analysis, I saw that we were missing the cache many
> times. There was one reason for that : the cache size was set to 1 !!!
> As soon as I set it to the default value (ie 1000), it took only 2.7
> seconds to browse forward and 3.8 to browse backward, with no cache miss !
>
> Those two changes make me think we should release now. The next steps
> will be to move back the CPB btree, and manage the free pages.
>
> More to come later ...
>
Ok, some more, but bad news...

at some point, the file can get corrupted. Typically, the free page list
is cycling, leading to a quick OOM. This has to be investigated.

I suspect that concurrent reads and writes are causing this issue, which
would mean the btree is not protected against concurrent writes.

Re: [Mavibot] Status

Posted by Emmanuel Lécharny <el...@gmail.com>.

A quick update...

I was not able to work too much on mavibot those last two weeks. Still,
some progress has ben made last week-end :

- first, DIRSERVER_2047 has been investigated deeply. It was really
problematic, as the browsefrom(K) method was producing various NPE and
wrong result. It took long to get it fixed. We have to thank Lin Zhao
for having been patient and helpful to get the pb fixed

- second, I ran a quick performance test to see what's going on.
Browsing a B-tree with 500 keys, each one of them having 9 values (which
resolves to a sub-btree), after having positionned the cursor before a
key from 0 to 499 took 9 seconds, and 23 seconds when browsing backward.
It was a bit slow, considering we are fetching 500 * 250 tuples (500
iterations, for an average of half the tuples being fetched). That was
around 14 000 tuples get per second.

So I profiled the test. I as surprised to see that we were deserializing
many many keys and values, which was not expected, as we have a cache.
After a bit of analysis, I saw that we were missing the cache many
times. There was one reason for that : the cache size was set to 1 !!!
As soon as I set it to the default value (ie 1000), it took only 2.7
seconds to browse forward and 3.8 to browse backward, with no cache miss !

Those two changes make me think we should release now. The next steps
will be to move back the CPB btree, and manage the free pages.

More to come later ...

Re: [Mavibot] Status

Posted by Kiran Ayyagari <ka...@apache.org>.

On Sat, Feb 21, 2015 at 4:32 AM, Emmanuel Lécharny <el...@gmail.com>
wrote:

> Hi guys,
>
> just a mail to explain what we are currently working on wrt Mavibot.
>
> Recently, we have had a few user's mail and issues filed for ApacheDS
> where Mavibot is being used. A couple of big issues has been fixed :
> typically, browsing a BTree was not possible in some corner cases, due
> to a bad implementation of a cursor. We also expect to use Mavibot more
> and more in the next few months. Fortress, for instance, which has a
> huge tests list, would benefit from the speedup.
>
> Now, a few big things has to be implemented. I'm currently working on
> the bulkLoader, which is workinhg when we have tuples with single
> values, but it fails when we inject tuples with multiple values, when we
> hit the value threashold (when we have more than 8 values, we don't
> store them in an array, but we use a sub-btree). It's a bit complex to
> implement, as we have to call the bulkloader inside the bulkloader, and
> be careful not to re-write pages that we already have written previously
> (this is what is currently happening atm, we create the sub-btree, which
> uses 3 pages, then we write the pages, and we update the previous
> sub-btree pages, which is a waste).
>
> At this point, I think an improvement would be to merge the BtreeHeader
> page and the BtreeInfo page, but taking care of not serializing the
> BtreeInfo over and over. That would save some space on the disk,
> especially for sub-btrees. Another option would be to avoid creating a
> btree-info for sub-btrees, as they inherit everything from their
> associated parent btree.
>
isn't this crucial for maintaining versions of sub-btree as well?

>
> The next big step is to finish the implementation of the free page
> management. This is really a must have. The first version should be a
> simple one, with a thread in charge of reclaiming the free pages when
> the oldest revision in used is released. This thread would then  reclaim
> all the associated pages, and put them back in the free list, and goes
> up reclaiming the versions that are orlder than the new oldest version.
> Btw, this reclaimer thread must be the writer thread, as we don't want
> to add some locks between those two threads (we need to figure out what
> would be the best mechanism here).
> A better algorithm, where we would reclaim unused pages in the middle of
> used revision, has to be implemented later.
>
blame me for this, this feature is currently lying in a branch with a
critical bug
which prevents it from saying 'completed'

>
> Last, not least, we have to add a transaction mechanism that allows the
> updates to be done in batches, and in memory. Ie, we don't update the
> disk, we just keep the modified pages in memory, so that when a page has
> to be updated twice, we don't need to fetch it from the disk and write
> it back. A simple hashMap keeping the pages associated with their offset
> should be enough, but that means we should check on this hashmap before
> fetching the page on the disk.
>
+1

> This feature brings to critical improvements :
> - first, it allows a set of updates made across many btrees to be seen
> as one single atomic operation. For the LDAP server, this is absolutely
> critical, as we always update many btrees when we update an entry.
> - second, this would speedup the updates, as the root pages, the headers
> and more important, the free paages and the record manager header will
> only be updated once, when they are updated once per update. I think the
> gain could be around 20%, minimum.
>
> Once those features are implemented, I'm quite sure that Mavibot 1.0 wil
> be in good shape.
>
> Thoughts ?
>
>


-- 
Kiran Ayyagari
http://keydap.com