You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2015/09/13 23:44:57 UTC

[Mavibot] BtreeHeader management, transactions and other things...

Hi,

I have looked at the code today, and I found that the way we handle the
BtreeHeader is a bit complex, and does not fit some other ideas I had
regarding the management of transactions.

Currently, we store a map of BH where for each BTree, we have the latest
BH (ie, the one associated with the most recent revision). When we want
to update a btree, or read it, we first check this map and use the
returned BH to start updating or reading the BTREE.

This is not good, IMO.

Actually, we should always fetch the most recent revision for a given
BTree from the BOB. That change the implementation of the
getBtreeHeader() method.

Why should we do it differently, and how does it connect with teh TXNs ?
That simple (well, sort of). txns will hold in a working memory (WM) all
the pages that will be updated from teh beginning to the end of the
transaction, allowing us to avoid many updates on disk - currently, the
way we process transaction is pretty brutal : we write teh modified
pages on disk, until teh end of the txn, even if we might very well
modify one of those pages -.

So the 'new way' should update the pages we have in the WM. That is
possible if we reference pages using their offset, but then that changes
the way we process the pages (currently, we preemptively copy a page
that we are going to modify). We will *not* anymore copy a page if it's
present in the WM, we will just update it. At the end, teh WM will
contain all the modified pages, and we will just have to write them on
disc (or discard them) when we commit (or rollback) teh transaction.

But the current code has only two way to fetch a page :
- either it's in the cache, and we return it
- or we read the page from disk
(This is what the PersistedPageHolder.getValue() does)

We need to add a third possibility : to get the page from the WM, when
we are updating the BTree, and if it's not present in teh WM, then fetch
it (from the cache or the disk) and put it into the WM.
Then the update (insert or delete) must be done without creating a copy.

That is a huge change in the code... But thsi is necessary if we want to
have an efficient transaction handling. It also allow us to get rid of
those synchronized Maps containing the BTreeHeaders.

One more things (à la Apple) : we most certainly don't need to manage
multiple values with sub-btrees in Mavibot : As soon as we have a fully
working transaction system, we could perfectly expect the application to
deal with such a specific case : all in all, in a Btree<K, V>, where V
is the user's data structure, it's up to the user to make V a BTree, and
to deal with it. As we will have a cross- b-tree transaction system, it
won't be expensive, plus this is already what we do with JDBM, so the
ApacheDS code will not be difficult to port.

A bit of work in our plates ;-)

Thoughts ?

Re: [Mavibot] BtreeHeader management, transactions and other things...

Posted by Kiran Ayyagari <ka...@apache.org>.

On Mon, Sep 14, 2015 at 4:31 PM, Emmanuel Lécharny <el...@gmail.com>
wrote:

> Le 14/09/15 04:40, Kiran Ayyagari a écrit :
> > On Mon, Sep 14, 2015 at 5:44 AM, Emmanuel Lécharny <el...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I have looked at the code today, and I found that the way we handle the
> >> BtreeHeader is a bit complex, and does not fit some other ideas I had
> >> regarding the management of transactions.
> >>
> >> Currently, we store a map of BH where for each BTree, we have the latest
> >> BH (ie, the one associated with the most recent revision). When we want
> >> to update a btree, or read it, we first check this map and use the
> >> returned BH to start updating or reading the BTREE.
> >>
> >> This is not good, IMO.
> >>
> >> Actually, we should always fetch the most recent revision for a given
> >> BTree from the BOB. That change the implementation of the
> >> getBtreeHeader() method.
> >>
> >> Why should we do it differently, and how does it connect with teh TXNs ?
> >> That simple (well, sort of). txns will hold in a working memory (WM) all
> >> the pages that will be updated from teh beginning to the end of the
> >> transaction, allowing us to avoid many updates on disk - currently, the
> >> way we process transaction is pretty brutal : we write teh modified
> >> pages on disk, until teh end of the txn, even if we might very well
> >> modify one of those pages -.
> >>
> >> So the 'new way' should update the pages we have in the WM. That is
> >> possible if we reference pages using their offset, but then that changes
> >> the way we process the pages (currently, we preemptively copy a page
> >> that we are going to modify). We will *not* anymore copy a page if it's
> >> present in the WM, we will just update it. At the end, teh WM will
> >> contain all the modified pages, and we will just have to write them on
> >> disc (or discard them) when we commit (or rollback) teh transaction.
> >>
> >> But the current code has only two way to fetch a page :
> >> - either it's in the cache, and we return it
> >> - or we read the page from disk
> >> (This is what the PersistedPageHolder.getValue() does)
> >>
> >> We need to add a third possibility : to get the page from the WM, when
> >> we are updating the BTree, and if it's not present in teh WM, then fetch
> >> it (from the cache or the disk) and put it into the WM.
> >> Then the update (insert or delete) must be done without creating a copy.
> >>
> >> That is a huge change in the code... But thsi is necessary if we want to
> >> have an efficient transaction handling. It also allow us to get rid of
> >> those synchronized Maps containing the BTreeHeaders.
> >>
> >> One more things (à la Apple) : we most certainly don't need to manage
> >> multiple values with sub-btrees in Mavibot : As soon as we have a fully
> >> working transaction system, we could perfectly expect the application to
> >> deal with such a specific case : all in all, in a Btree<K, V>, where V
> >> is the user's data structure, it's up to the user to make V a BTree, and
> >> to deal with it. As we will have a cross- b-tree transaction system, it
> >> won't be expensive, plus this is already what we do with JDBM, so the
> >> ApacheDS code will not be difficult to port.
> >>
> > we should move to explicit begin() and commit() to support the cross
> Btree
> > transactions,
>
> Absolutely. The current transaction support is less than half baked, it
> only support per-BTree transaction, which is pretty much useless.
>
> Actually, we do need to pass a context to each Btree operation, context
> that holds the revision of each B-trees being part of the transaction.
> That also mean we have one common revision for each transaction,
> revision which ties all the Btrees' revisions that were part of the
> transaction.
>
> Typically, in a ApacheDS context, when we update an entry, we nt only
> update the master BTree, but the RDN btree and every btree associated
> with each index. And all those updates must be visible as a whole when
> we want to fetch an entry or an index for this revision.
>
> the solution would be to use the transaction ID (which will be
> incremental) as the revision number for all the b-trees being updated
> during that transaction. That saves us the pain of keeping a track of
> the various B-trees revisions when applying an operation to a given
> revision of a b-tree.
>
> +1 that is a good idea



-- 
Kiran Ayyagari
http://keydap.com

Re: [Mavibot] BtreeHeader management, transactions and other things...

Posted by Emmanuel Lécharny <el...@gmail.com>.

Le 14/09/15 04:40, Kiran Ayyagari a écrit :
> On Mon, Sep 14, 2015 at 5:44 AM, Emmanuel Lécharny <el...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have looked at the code today, and I found that the way we handle the
>> BtreeHeader is a bit complex, and does not fit some other ideas I had
>> regarding the management of transactions.
>>
>> Currently, we store a map of BH where for each BTree, we have the latest
>> BH (ie, the one associated with the most recent revision). When we want
>> to update a btree, or read it, we first check this map and use the
>> returned BH to start updating or reading the BTREE.
>>
>> This is not good, IMO.
>>
>> Actually, we should always fetch the most recent revision for a given
>> BTree from the BOB. That change the implementation of the
>> getBtreeHeader() method.
>>
>> Why should we do it differently, and how does it connect with teh TXNs ?
>> That simple (well, sort of). txns will hold in a working memory (WM) all
>> the pages that will be updated from teh beginning to the end of the
>> transaction, allowing us to avoid many updates on disk - currently, the
>> way we process transaction is pretty brutal : we write teh modified
>> pages on disk, until teh end of the txn, even if we might very well
>> modify one of those pages -.
>>
>> So the 'new way' should update the pages we have in the WM. That is
>> possible if we reference pages using their offset, but then that changes
>> the way we process the pages (currently, we preemptively copy a page
>> that we are going to modify). We will *not* anymore copy a page if it's
>> present in the WM, we will just update it. At the end, teh WM will
>> contain all the modified pages, and we will just have to write them on
>> disc (or discard them) when we commit (or rollback) teh transaction.
>>
>> But the current code has only two way to fetch a page :
>> - either it's in the cache, and we return it
>> - or we read the page from disk
>> (This is what the PersistedPageHolder.getValue() does)
>>
>> We need to add a third possibility : to get the page from the WM, when
>> we are updating the BTree, and if it's not present in teh WM, then fetch
>> it (from the cache or the disk) and put it into the WM.
>> Then the update (insert or delete) must be done without creating a copy.
>>
>> That is a huge change in the code... But thsi is necessary if we want to
>> have an efficient transaction handling. It also allow us to get rid of
>> those synchronized Maps containing the BTreeHeaders.
>>
>> One more things (à la Apple) : we most certainly don't need to manage
>> multiple values with sub-btrees in Mavibot : As soon as we have a fully
>> working transaction system, we could perfectly expect the application to
>> deal with such a specific case : all in all, in a Btree<K, V>, where V
>> is the user's data structure, it's up to the user to make V a BTree, and
>> to deal with it. As we will have a cross- b-tree transaction system, it
>> won't be expensive, plus this is already what we do with JDBM, so the
>> ApacheDS code will not be difficult to port.
>>
> we should move to explicit begin() and commit() to support the cross Btree
> transactions, 

Absolutely. The current transaction support is less than half baked, it
only support per-BTree transaction, which is pretty much useless.

Actually, we do need to pass a context to each Btree operation, context
that holds the revision of each B-trees being part of the transaction.
That also mean we have one common revision for each transaction,
revision which ties all the Btrees' revisions that were part of the
transaction.

Typically, in a ApacheDS context, when we update an entry, we nt only
update the master BTree, but the RDN btree and every btree associated
with each index. And all those updates must be visible as a whole when
we want to fetch an entry or an index for this revision.

the solution would be to use the transaction ID (which will be
incremental) as the revision number for all the b-trees being updated
during that transaction. That saves us the pain of keeping a track of
the various B-trees revisions when applying an operation to a given
revision of a b-tree.

Re: [Mavibot] BtreeHeader management, transactions and other things...

Posted by Kiran Ayyagari <ka...@apache.org>.

On Mon, Sep 14, 2015 at 5:44 AM, Emmanuel Lécharny <el...@gmail.com>
wrote:

> Hi,
>
> I have looked at the code today, and I found that the way we handle the
> BtreeHeader is a bit complex, and does not fit some other ideas I had
> regarding the management of transactions.
>
> Currently, we store a map of BH where for each BTree, we have the latest
> BH (ie, the one associated with the most recent revision). When we want
> to update a btree, or read it, we first check this map and use the
> returned BH to start updating or reading the BTREE.
>
> This is not good, IMO.
>
> Actually, we should always fetch the most recent revision for a given
> BTree from the BOB. That change the implementation of the
> getBtreeHeader() method.
>
> Why should we do it differently, and how does it connect with teh TXNs ?
> That simple (well, sort of). txns will hold in a working memory (WM) all
> the pages that will be updated from teh beginning to the end of the
> transaction, allowing us to avoid many updates on disk - currently, the
> way we process transaction is pretty brutal : we write teh modified
> pages on disk, until teh end of the txn, even if we might very well
> modify one of those pages -.
>
> So the 'new way' should update the pages we have in the WM. That is
> possible if we reference pages using their offset, but then that changes
> the way we process the pages (currently, we preemptively copy a page
> that we are going to modify). We will *not* anymore copy a page if it's
> present in the WM, we will just update it. At the end, teh WM will
> contain all the modified pages, and we will just have to write them on
> disc (or discard them) when we commit (or rollback) teh transaction.
>
> But the current code has only two way to fetch a page :
> - either it's in the cache, and we return it
> - or we read the page from disk
> (This is what the PersistedPageHolder.getValue() does)
>
> We need to add a third possibility : to get the page from the WM, when
> we are updating the BTree, and if it's not present in teh WM, then fetch
> it (from the cache or the disk) and put it into the WM.
> Then the update (insert or delete) must be done without creating a copy.
>
> That is a huge change in the code... But thsi is necessary if we want to
> have an efficient transaction handling. It also allow us to get rid of
> those synchronized Maps containing the BTreeHeaders.
>
> One more things (à la Apple) : we most certainly don't need to manage
> multiple values with sub-btrees in Mavibot : As soon as we have a fully
> working transaction system, we could perfectly expect the application to
> deal with such a specific case : all in all, in a Btree<K, V>, where V
> is the user's data structure, it's up to the user to make V a BTree, and
> to deal with it. As we will have a cross- b-tree transaction system, it
> won't be expensive, plus this is already what we do with JDBM, so the
> ApacheDS code will not be difficult to port.
>
we should move to explicit begin() and commit() to support the cross Btree
transactions, this will impact ApacheDS code a bit cause now we need
a txn handle to pass around

>
> A bit of work in our plates ;-)
>
yep

>
> Thoughts ?
>
>


-- 
Kiran Ayyagari
http://keydap.com