You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Dibyendu Majumdar <di...@mazumdar.demon.co.uk> on 2005/09/03 23:41:29 UTC

BTree package documentation

Hello,

Attached is some documentation for the package 
org.apache.derby.impl.store.access.btree.
Most of it is based upon information provided by Mike.

Let me know if there are any inaccuracies. I think that there are some 
areas where further details would be useful:

1) How are duplicate keys handled in secondary indexes.
2) Some more detail on logging and recovery would be useful.
3) Internal transactions and how they are used to manage SMOs.

I would like this document to be made available also as a link in the 
Papers section.

Thanks and Regards

Dibyendu

Re: BTree package documentation

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Mike Matrigali wrote:
> See comments below, once updated I will work on getting it into the
> codeline.

thanks, Mike, when it's in the codeline I'll replace the site doc with a 
link.

-jean

Re: BTree package documentation (revisited)

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

On Tue, 06 Sep 2005 11:06:03 -0700 Mike Matrigali 
<mi...@sbcglobal.net> wrote:

> See comments below, once updated I will work on getting it into the
> codeline.

And Dibyendu Majumdar <di...@mazumdar.demon.co.uk> followed up with:

> ... 
> Attached is a revised package.html with Javadoc style links. I haven't 
> been able to test the links - but I think they should work.
> ... 

Links for both emails:
http://mail-archives.apache.org/mod_mbox/db-derby-dev/200509.mbox/%3c431DDA8B.7020009@sbcglobal.net%3e
http://mail-archives.apache.org/mod_mbox/db-derby-dev/200509.mbox/%3c431E2B62.8090908@mazumdar.demon.co.uk%3e

I completely lost track of this item. Did this get checked into the code 
tree? I'm finding a 
org/apache/derby/impl/store/access/btree/index/package.html , but not a 
org/apache/derby/impl/store/access/btree/package.html that corresponds 
to the content I initially put at 
http://db.apache.org/derby/papers/btree_package.html .

thanks (and apologies for the delay and confusion),

  -jean

Re: BTree package documentation

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I agree there is much more space to save in leaf level compression.
Duplicate key and prefix compression are a couple of possible areas.

Dibyendu Majumdar wrote:

> Mike Matrigali wrote:
> 
>> One could create compressed branch entries if the RowLocation were
>> not needed to differentiate between 2 adjacent entries (ie.
>> for 2 adjacent branch entries: mike, row loc 1, leaf page 113; stan,
>> row loc 2, leaf page 112 -- for these 2 branch rows the minimum
>> suffix of mike/stan would be enough without the row location).
>> Because the page/row format in derby does not require constant number
>> of column rows such a change would not be too hard.  One could also
>> do suffix compression within a column - for instance in the previous
>> example only m/s is really necessary, but such a decision is very
>> datatype dependent and needs to be supported by the datatype itself -
>> not a decision by store looking at the bytes.
>>  
>>
> I guess it would be more fruitful to do compression of duplicate keys in
> non-unique indexes at leaf level pages, unless this is already done.
> 
> Attached is a revised package.html with Javadoc style links. I haven't
> been able to test the links - but I think they should work.
> 
> Regards
> 
> ------------------------------------------------------------------------
> 
> Implements BTree access method, which is the basis for SQL indexes.
> 
> 
>   Overview
> 
> Derby implements secondary SQL indexes as BTrees. The high level
> features of the BTree implementation are:
> 
>    1. Derby implements the standard B+Tree algorithm, in which all keys
>       are stored in leaf pages, and the upper levels are organized as a
>       redundant index for enabling rapid location of a particular key.
>       See The Ubiquitous B-Tree, Douglas Comer
>       <http://portal.acm.org/citation.cfm?id=356776> for a definition of
>       the B+Tree.
>    2. The BTree implementation supports concurrency using page level
>       latches, and recovery using a Write Ahead Log.
>    3. Derby uses data only locking for its logical row level locking.
>    4. Derby supports all four SQL isolation levels, serializable,
>       repeatable read, read committed and read uncommitted.
>    5. Derby uses logical key deletes. This enables it to perform undos
>       during rollbacks and restart recovery as single page operations.
> 
> 
>   High level structure of the B+Tree
> 
>     * The first row in all pages, branch or leaf, is a control row. In
>       branch pages, this is a {@link
>       org.apache.derby.impl.store.access.btree.BranchControlRow
>       BranchControlRow} and in leaf pages, it is a {@link
>       org.apache.derby.impl.store.access.btree.LeafControlRow
>       LeafControlRow}.
>     * In addition to the BranchControlRow, branch level pages contain
>       {@link org.apache.derby.impl.store.access.btree.BranchRow
>       BranchRows}.
>     * At branch level, a page is linked to is left and right siblings,
>       parent, as well as children. The number of child pages is 1
>       greater than the number of BranchRows in the page. The leftmost
>       child pointer is stored in the BranchControlRow itself. Each
>       BranchRow contains a child pointer as its last column.
>     * At leaf level also, a page is linked to its left and right
>       siblings, and the parent. In addition to the LeafControlRow, leaf
>       level pages contain {@link
>       org.apache.derby.impl.sql.execute.IndexRow IndexRows}. The last
>       column in an IndexRow is a {@link
>       org.apache.derby.iapi.types.RowLocation RowLocation}.
>     * IndexRows are generated by the {@link
>       org.apache.derby.iapi.sql.dictionary.IndexRowGenerator
>       IndexRowGenerator}.
> 
> 
>   Handling of Unique Keys
> 
> From one perspective the current BTree implementation assumes all rows
> it handles are unique. This is because the row always includes the
> address of the base table row (RowLocation). One could not use the
> current BTree implementation to store two rows which were exactly the
> same including the RowLocation. Having all rows as unique makes the code
> simpler as binary searches don't have to worry about edge cases of
> duplicate keys.
> 
> Having said that, the BTree implementation does support rejecting rows
> which are not unique given a subset of the columns of the row. Currently
> this is only used to compare whether index rows are unique, by comparing
> all columns other than the last one. So, to {@link
> org.apache.derby.impl.store.access.btree.BTree#create create} a unique
> SQL index you create a BTree telling it to enforce uniqueness on number
> of columns - 1; and to create a non-unique SQL index you create a BTree
> telling it to enforce uniqueness on all columns.
> 
> Note that currently branch rows also include the RowLocation. For most
> of the code RowLocation is basically opaque, so most of the code just
> treats the BTree as follows:
> 
>     * BTree is created with N columns
>     * Every row must be unique when comparing all N columns
>     * Branch rows contain N+1 columns - all the columns from the leaf
>       entry plus one to tell it which page is the sibling.
> 
> There is no reason that the branch rows always need the row location,
> the minimum requirement is that 2 branch rows must uniquely discriminate
> the keys on 2 different leaf pages. Again for simplicity it is assumed
> that the keys on a branch page are never duplicate, this simplifies
> binary search somewhat.
> 
> 
>   Latching implementation
> 
> Derby uses latches on pages to get exclusive access to the page while
> reading or writing the page (Derby only uses exclusive latches, no
> shared latches are used). In order to prevent deadlocks latches
> requested while holding other latches are always requested top/down and
> left to right. Btree splits are always left to right. If for any reason
> the code needs to break this protocol then it will first request the
> latch NOWAIT and if it can't get the latch it will release all current
> latches and wait for the latch it is trying to get, and then after
> obtaining it go about getting the other latches it needs for the
> particular operation. While traversing down the tree Derby may hold 2
> latches: one on parent and one on child. It then continues doing
> "ladder" latching down the tree releasing the highest node when it has
> successfully got a new lower node latch. Latches are short term, only
> held while reading/modifying the page, never held while an I/O is
> happening. Structure modifications are all isolated from other
> operations through the use of latches.
> 
> 
>   Locking and Isolation Levels
> 
> Derby uses data only locking for its logical row level locking. All
> isolation level implementation is done using logical locks (Derby does
> not support non-locking isolation such as multi-versioning).
> 
> In data only locking the key for all locks whether obtained in the BTree
> or in the base table is the address of the base table row. In the BTree
> case the address of the base table row ({@link
> org.apache.derby.iapi.types.RowLocation RowLocation}) is always the last
> column of the leaf index entry ({@link
> org.apache.derby.impl.sql.execute.IndexRow IndexRows}).
> 
> Serializable
>     Derby uses "previous key" locking to prevent phantoms from being
>     inserted into a range being accessed by a scan. A scan always locks
>     the index row just "previous" to the first entry that satisfies the
>     scan condition and then all index entries going forward until the
>     end of the scan. All row locks are held until end of transaction.
>     Inserts also get previous key locks, which will block if the
>     existing row is locked.
> Repeatable Read
>     Same as serializable except that no previous key locking is done.
> Read Committed
>     Write locks are held until end of transaction. Read locks are
>     released once the query in question no longer needs them.
> Read Uncommitted
>     Write locks are held until end of transaction. No read row locks are
>     acquired. The code still gets table level intent locks to prevent
>     concurrent DDL during the query.
> 
> 
>   BTree Structure Modifications
> 
> In Derby, SMOs (structure modification operations - ie. page splits),
> only happen top down. This is not as concurrent as bottom up in ARIES/IM
> <http://www.almaden.ibm.com/u/mohan/RJ6846.pdf>, but is simpler. As in
> ARIES/IM "Not more than 2 index pages are held latched simultaneously at
> anytime. In order to improve concurrency and to avoid deadlocks
> involving latches, even those latches are not held while waiting for a
> lock wich is not immediately grantable. No data page latch is held or
> acquired during an index access. Latch coupling is used while traversing
> the tree - ie. the latch on a parent page is held while requesting a
> latch on a child page."
> 
> In the case of a simple split, exclusive latch is held on P (parent),
> and C (child). If there is room in P, then new page C1 (child right) is
> created, new descriminator is inserted into P, and rows moved to C1.
> Latches are held on P and C for duration of split, and then released.
> The new page C1 is retrieved from a list of free pages which may have
> come from preallocation or from deleted space reclamation. If there is
> no free space, space is automatically added to the file and used.
> 
> The hard case is when P does not have room for descriminator key. In
> this case all latches are released, and Derby does a split pass from top
> to bottom, and will split the internal nodes that do not have room for
> the decrimator key. Note this may result in more splits than necessary
> for this particular insert, but the assumption is that the splits will
> have to happen eventually anyway. After this split pass is done, the
> search for the insert starts again from top down, but it must once again
> check for space because it has given up all its latches and some other
> transaction may have acquired the space in the meantime.
> 
> Optimization is possible to remember C and see if it is right location,
> and/or use sideway pointers to search right rather than do research of tree.
> 
> 
>   Logical Key Deletes
> 
> In both the BTree and the Heap, deletes are first executed by marking a
> "deleted" bit. This is to insure space on the page for abort, since row
> level locking will allow other rows on the page to be modified
> conncurrently with the transaction executing the delete. The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows. A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the row).
> 
> 
>   Garbage Collection of deleted keys
> 
> Since rows are only marked as "deleted", and not physically removed, it
> is necessary to perform space reclamation on deleted rows.
> 
> 
>     Online during BTREE split
> 
> Whenever there is not enough room on a leaf to do an insert the code
> attempts to find space on the leaf, by checking if it can reclaim any
> committed deletes on that leaf. That work only requires the latch on the
> leaf and NOWAIT row write locks. It is expected that most of the space
> reclaim done in the BTree goes through this path. Most of this work is
> done in {@link
> org.apache.derby.impl.store.access.btree.BTreeController#reclaim_deleted_rows
> BTreeController.reclaim_deleted_rows}.
> 
> 
>     BTREE post commit work
> 
> Whenever a delete operation deletes the last row from a leaf page then a
> BtreePostCommit job is queued to be executed after the transaction which
> did the delete commits. This work currently requires a table level lock
> as page merges have not been implemented to be allowed concurrent with
> other operations. Many DBMSes don't even try to do page merges except
> when called from some sort of reorg utility. If all rows on page are
> purged, then the page will move to the free list and perform a merge to
> the tree.
> 
> It is expected that the normal space reclamation happens with row locks
> during btree split, which is why not much work has been done to optimize
> btree post commit path
> 
> 
>   Logging and Recovery
> 
> Derby uses physical redo and logical undo for BTree inserts and deletes.
> Logical undo is simplified as a result of using logical key deletes. If
> keys were physically removed during deletes, then the undo of a key
> delete would have required an insert operation which can potentially
> lead to page splits at various levels within the tree. Since the key is
> not physically removed, but only marked as "deleted", undoing a key
> delete is accomplished easily. However, since the page where the insert
> or delete should take place may have moved, it may be necessary to
> {@link org.apache.derby.impl.store.access.btree.index.B2IUndo#findUndo
> search} for the page.
> 
> Structure modification operations (SMOs) are handled as independent
> {@link
> org.apache.derby.iapi.store.access.conglomerate.TransactionManager#getInternalTransaction
> internal transactions} and commit separately from the transaction that
> initiated the SMO. Once an SMO has been completed successfully, it is
> not undone, even if the transaction that caused it decides to abort.
> During restart recovery undo phase, incomplete internal transactions are
> undone BEFORE any regular transactions. This ensures that the BTrees are
> structurally consistent before normal undo begins.
>

Re: BTree package documentation

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:

>One could create compressed branch entries if the RowLocation were
>not needed to differentiate between 2 adjacent entries (ie.
>for 2 adjacent branch entries: mike, row loc 1, leaf page 113; stan,
>row loc 2, leaf page 112 -- for these 2 branch rows the minimum
>suffix of mike/stan would be enough without the row location).
>Because the page/row format in derby does not require constant number
>of column rows such a change would not be too hard.  One could also
>do suffix compression within a column - for instance in the previous
>example only m/s is really necessary, but such a decision is very
>datatype dependent and needs to be supported by the datatype itself -
>not a decision by store looking at the bytes.
>  
>
I guess it would be more fruitful to do compression of duplicate keys in 
non-unique indexes at leaf level pages, unless this is already done.

Attached is a revised package.html with Javadoc style links. I haven't 
been able to test the links - but I think they should work.

Regards

Re: BTree package documentation

Posted by Mike Matrigali <mi...@sbcglobal.net>.

No, currently branch rows also include the RowLocation.  For most of
the code RowLocation is basically opaque, so most of the code just
treats the btree as follows:
o btree is created with N columns
o Every row must be unique when comparing all N columns
o Branch rows contain N+1 columns - all the columns from the leaf
  entry plus one to tell it which page is the sibling.

There is no reason that the branch rows always need the row location,
the minimum requirement is that 2 branchrows must uniquely discriminate
the keys on 2 different leaf pages.  Again for simplicity it is assumed
that the keys on a branch page are never duplicate, this simplifies
binary search somewhat.

Imagine a one column SQL non-unique index for which all the rows have
the same value.  In the case of a delete on a row in the base table,
a search on the index which includes the RowLocation would not be
able to efficiently find the right row if the RowLocation was not
included in the branch rows.

One could create compressed branch entries if the RowLocation were
not needed to differentiate between 2 adjacent entries (ie.
for 2 adjacent branch entries: mike, row loc 1, leaf page 113; stan,
row loc 2, leaf page 112 -- for these 2 branch rows the minimum
suffix of mike/stan would be enough without the row location).
Because the page/row format in derby does not require constant number
of column rows such a change would not be too hard.  One could also
do suffix compression within a column - for instance in the previous
example only m/s is really necessary, but such a decision is very
datatype dependent and needs to be supported by the datatype itself -
not a decision by store looking at the bytes.

Dibyendu Majumdar wrote:

> Mike Matrigali wrote:
> 
>> See comments below, once updated I will work on getting it into the
>> codeline.
>>
>> Comments on unique key handling.
>>
>>> From one perspective the current btree implementation assumes all rows
>>
>> it handles are unique.  This is because the row includes the the last
>> column which is the address of the base table row.  One could not use
>> the current btree implementation to store 2 rows which were exactly
>> the same including the row location.  At this level having all rows
>> being unique makes the code simpler as binary searches don't have to
>> worry about edge cases of duplicate keys.
>>
>> Having said that the btree implementation does support rejecting rows
>> which are not unique given a subset of the columns of the row.
>> Currently this is only used to compare whether index rows are unique
>> comparing all columns other than the last one.  So to create a unique
>> SAL index you create a btree telling it to enforce uniqueness on
>> number of columns - 1; and to create a non-unique SQL index you create a
>> btree telling it to enforce uniqueness on all columns.
>>  
>>
> Mike,
> 
> I assume you mean that at the leaf level, no two IndexRows can contain
> the same set of keys
> and also the RowLocation. I assume my understanding is correct that
> BranchRows do not
> contain the RowLocation.
> 
> I will send you an updated html - do you want the links in Javadoc format?
> 
> Thanks
> 
> Dibyendu
> 
>

Re: BTree package documentation

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:

>See comments below, once updated I will work on getting it into the
>codeline.
>
>Comments on unique key handling.
>>>From one perspective the current btree implementation assumes all rows
>it handles are unique.  This is because the row includes the the last
>column which is the address of the base table row.  One could not use
>the current btree implementation to store 2 rows which were exactly
>the same including the row location.  At this level having all rows
>being unique makes the code simpler as binary searches don't have to
>worry about edge cases of duplicate keys.
>
>Having said that the btree implementation does support rejecting rows
>which are not unique given a subset of the columns of the row.
>Currently this is only used to compare whether index rows are unique
>comparing all columns other than the last one.  So to create a unique
>SAL index you create a btree telling it to enforce uniqueness on
>number of columns - 1; and to create a non-unique SQL index you create a
>btree telling it to enforce uniqueness on all columns.
>  
>
Mike,

I assume you mean that at the leaf level, no two IndexRows can contain 
the same set of keys
and also the RowLocation. I assume my understanding is correct that 
BranchRows do not
contain the RowLocation.

I will send you an updated html - do you want the links in Javadoc format?

Thanks

Dibyendu

Re: BTree package documentation

Posted by Mike Matrigali <mi...@sbcglobal.net>.

See comments below, once updated I will work on getting it into the
codeline.

Comments on unique key handling.
>From one perspective the current btree implementation assumes all rows
it handles are unique.  This is because the row includes the the last
column which is the address of the base table row.  One could not use
the current btree implementation to store 2 rows which were exactly
the same including the row location.  At this level having all rows
being unique makes the code simpler as binary searches don't have to
worry about edge cases of duplicate keys.

Having said that the btree implementation does support rejecting rows
which are not unique given a subset of the columns of the row.
Currently this is only used to compare whether index rows are unique
comparing all columns other than the last one.  So to create a unique
SAL index you create a btree telling it to enforce uniqueness on
number of columns - 1; and to create a non-unique SQL index you create a
btree telling it to enforce uniqueness on all columns.

Dibyendu Majumdar wrote:

> Attached is an updated html. Let me know if there are still any issues.
> 
> This should also be saved as package.html under
> org.apache.derby.impl.store.access.btree.
> 
> Regards
> 
> ------------------------------------------------------------------------
> 
> Implements BTree access method, which is the basis for SQL indexes.
> 
> 
>   Overview
> 
> Derby implements secondary SQL indexes as BTrees. The high level
> features of the BTree implementation are:
> 
>    1. Derby implements the standard B+Tree algorithm, in which all keys
>       are stored in leaf pages, and the upper levels are organized as a
>       redundant index for enabling rapid location of a particular key.
>       See The Ubiquitous B-Tree, Douglas Comer
>       <http://portal.acm.org/citation.cfm?id=356776> for a definition of
>       the B+Tree.
>    2. The BTree implementation supports concurrency using page level
>       latches, and recovery using a Write Ahead Log.
>    3. Derby uses data only locking for its logical row level locking.
>    4. Derby supports all four SQL isolation levels, serializable,
>       repeatable read, read committed and read uncommitted.
>    5. Derby uses logical key deletes. This enables it to perform undos
>       during rollbacks and restart recovery as single page operations.
> 
> 
>   High level structure of the B+Tree
> 
>     * The first row in all pages, branch or leaf, is a control row. In
>       branch pages, this is a BranchControlRow
>       <http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/access/btree/BranchControlRow.html>
>       and in leaf pages, it is a LeafControlRow
>       <http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/access/btree/LeafControlRow.html>.
>     * In addition to the BranchControlRow, branch level pages contain
>       BranchRows
>       <http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/access/btree/BranchRow.html>.
> 
>     * At branch level, a page is linked to is left and right siblings,
>       parent, as well as children. The number of child pages is 1
>       greater than the number of BranchRows in the page. The leftmost
>       child pointer is stored in the BranchControlRow itself. Each
>       BranchRow contains a child pointer as its last column.
>     * At leaf level also, a page is linked to its left and right
>       siblings, and the parent. In addition to the LeafControlRow, leaf
>       level pages contain IndexRows
>       <http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/sql/execute/IndexRow.html>.
>       The last column in an IndexRow is a RowLocation
>       <http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/types/RowLocation.html>.
>     * IndexRows are generated by the IndexRowGenerator
>       <http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/sql/dictionary/IndexRowGenerator.html>.
> 
> 
>   Latching implementation
> 
> Derby uses latches on pages to get exclusive access to the page while
> reading or writing the page (Derby only uses exclusive latches, no
> shared latches are used). In order to prevent deadlocks latches
> requested while holding other latches are always requested top/down and
> left to right. Btree splits are always left to right. If for any reason
> the code needs to break this protocol then it will first request the
> latch NOWAIT and if it can't get the latch it will release all current
> latches and wait for the latch it is trying to get, and then after
> obtaining it go about getting the other latches it needs for the
> particular operation. While traversing down the tree Derby may hold 2
> latches: one on parent and one on child. It then continues doing
> "ladder" latching down the tree releasing the highest node when it has
> successfully got a new lower node latch. Latches are short term, only
> held while reading/modifying the page, never held while an I/O is
> happening. Structure modifications are all isolated from other
> operations through the use of latches.
> 
> 
>   Locking and Isolation Levels
> 
> Derby uses data only locking for its logical row level locking. All
> isolation level implementation is done using logical locks (Derby does
> not support non-locking isolation such as multi-versioning).
> 
> In data only locking the key for all locks whether obtained in the BTree
> or in the base table is the address of the base table row. In the BTree
> case the address of the base table row (RowLocation
> <http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/types/RowLocation.html>)
> is always the last column of the leaf index entry (IndexRows
> <http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/sql/execute/IndexRow.html>).
> 
> Serializable
>     Derby uses "previous key" locking to prevent phantoms from being
>     inserted into a range being accessed by a scan. A scan always locks
>     the index row just "previous" to the first entry that satisfies the
>     scan condition and then all index entries going forward until the
>     end of the scan. All row locks are held until end of transaction.
>     Inserts also get previous key locks, which will block if the
>     existing row is locked.
> Repeatable Read
>     Same as serializable except that no previous key locking is done.
> Read Committed
>     Write locks are held until end of transaction. Read locks are
>     released once the query in question no longer needs them.
> Read Uncommitted
Write locks are held until end of transaction.  No read row locks are
acquired.
>     No row locks are acquired. The code still gets table level intent
>     locks to prevent concurrent DDL during the query.
> 
> 
>   BTree Structure Modifications
> 
> In Derby, SMOs (structure modification operations - ie. page splits),
> only happen top down. This is not as concurrent as bottom up in ARIES/IM
> <http://www.almaden.ibm.com/u/mohan/RJ6846.pdf>, but is simpler. As in
> ARIES/IM "Not more than 2 index pages are held latched simultaneously at
> anytime. In order to improve concurrency and to avoid deadlocks
> involving latches, even those latches are not held while waiting for a
> lock wich is not immediately grantable. No data page latch is held or
> acquired during an index access. Latch coupling is used while traversing
> the tree - ie. the latch on a parent page is held while requesting a
> latch on a child page."
> 
> In the case of a simple split, exclusive latch is held on P (parent),
> and C (child). If there is room in P, then new page C1 (child right) is
> created, new descriminator is inserted into P, and rows moved to C1.
> Latches are held on P and C for duration of split, and then released.
The new page is retrieved from a list of free pages which may have come
from preallocation or from deleted space reclamation, if there is no
free space space is automatically added to the file and used.
> 
> The hard case is when P does not have room for descriminator key. In
> this case all latches are released, and Derby does a split pass from top
> to bottom, and will split the internal nodes that do not have room for
> the decrimator key. Note this may result in more splits than necessary
> for this particular insert, but the assumption is that the splits will
> have to happen eventually anyway. After this split pass is done, the
> search for the insert starts again from top down, but it must once again
> check for space because it has given up all its latches and some other
> transaction may have acquired the space in the meantime.
> 
> Optimization is possible to remember C and see if it is right location,
> and/or use sideway pointers to search right rather than do research of tree.
> 
> 
>   Logical Key Deletes
> 
> In both the BTree and the Heap, deletes are first executed by marking a
> "deleted" bit. This is to insure space on the page for abort, since row
> level locking will allow other rows on the page to be modified
> conncurrently with the transaction executing the delete. The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows. A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the row).
> 
> 
>   Garbage Collection of deleted keys
> 
> Since rows are only marked as "deleted", and not physically removed, it
> is necessary to perform space reclamation on deleted rows.
> 
> 
>     Online during BTREE split
> 
> Whenever there is not enough room on a leaf to do an insert the code
> attempts to find space on the leaf, by checking if it can reclaim any
> committed deletes on that leaf. That work only requires the latch on the
> leaf and NOWAIT row write locks. It is expected that most of the space
> reclaim done in the BTree goes through this path. Most of this work is
> done in BTreeController.reclaim_deleted_rows
> <http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/access/btree/BTreeController.html#reclaim_deleted_rows(org.apache.derby.impl.store.access.btree.OpenBTree,
> long)>.
> 
> 
>     BTREE post commit work
> 
> Whenever a delete operation deletes the last row from a leaf page then a
> BtreePostCommit job is queued to be executed after the transaction which
> did the delete commits. This work currently requires a table level lock
> as page merges have not been implemented to be allowed concurrent with
> other operations. Many DBMSes don't even try to do page merges except
> when called from some sort of reorg utility. If all rows on page are
> purged, then the page will move to the free list and perform a merge to
> the tree.
> 
> It is expected that the normal space reclamation happens with row locks
> during btree split, which is why not much work has been done to optimize
> btree post commit path
> 
> 
>   Logging and Recovery
> 
> Derby uses physical redo and logical undo for BTree inserts and deletes.
> Logical undo is simplified as a result of using logical key deletes. If
> keys were physically removed during deletes, then the undo of a key
> delete would have required an insert operation which can potentially
> lead to page splits at various levels within the tree. Since the key is
> not physically removed, but only marked as "deleted", undoing a key
> delete is accomplished easily. However, since the page where the insert
> or delete should take place may have moved, it may be necessary to
> search
> <http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/access/btree/index/B2IUndo.html#findUndo(org.apache.derby.iapi.store.raw.Transaction,
> org.apache.derby.iapi.store.raw.LogicalUndoable,
> org.apache.derby.iapi.services.io.LimitObjectInput)> for the page.
> 
> Structure modification operations (SMOs) are handled as independent
> internal transactions
> <http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/store/access/conglomerate/TransactionManager.html#getInternalTransaction()>
> and commit separately from the transaction that initiated the SMO. Once
> an SMO has been completed successfully, it is not undone, even if the
> transaction that caused it decides to abort. During restart recovery
> undo phase, incomplete internal transactions are undone BEFORE any
> regular transactions. This ensures that the BTrees are structurally
> consistent before normal undo begins.
>

Re: BTree package documentation

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:
> Jean T. Anderson wrote:
> 
>> So, where do you prefer this document to reside? In the web site tree 
>> or code tree?
>  
> My preference is that it should go into the code tree, and get linked 
> from there to the website, just as the Type System page is. Only, it 
> would be better to change the links to Javadoc style ... I can do this 
> tomorrow, and send you another version.

I'm re-reading your original post and you did indeed request just a link 
from the Papers tab. Sorry! I automatically headed down a web site 
tangent -- and my apologies for the confusion.

I'm only handling commits for the site and doc trees. Somebody else will 
need to pick this doc up for the code tree.

-jean

Re: BTree package documentation

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean T. Anderson wrote:

> So, where do you prefer this document to reside? In the web site tree 
> or code tree?

My preference is that it should go into the code tree, and get linked 
from there to the website, just as the Type System page is. Only, it 
would be better to change the links to Javadoc style ... I can do this 
tomorrow, and send you another version.

Regards

Re: BTree package documentation

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:
> Attached is an updated html. Let me know if there are still any issues.
> 
> This should also be saved as package.html under 
> org.apache.derby.impl.store.access.btree.

Dibyendu,

I think this file should be in the web site tree or in the code tree, 
but not both. The problem is remembering to update both locations when 
the file changes.

Originally, the web site stored a copy of derby/code/trunk/BUILDING.txt, 
but it made more sense to link to 
http://svn.apache.org/repos/asf/db/derby/code/trunk/BUILDING.txt instead 
so people visiting the web site automatically see any updates.

There is admittedly an issue for some browsers with linking to html 
files in the code tree, but there is a workaround for that issue (save 
the file locally, then reopen locally) and it's preferable to 
duplicating the file.

So, where do you prefer this document to reside? In the web site tree or 
code tree?

  -jean

Re: BTree package documentation

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Attached is an updated html. Let me know if there are still any issues.

This should also be saved as package.html under 
org.apache.derby.impl.store.access.btree.

Regards

Re: BTree package documentation

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Jean T. Anderson wrote:
> Dibyendu Majumdar wrote:
> 
>> Jean T. Anderson wrote:
>>
>>> Committed revision 278552. I made one minor change of the h2 tags to 
>>> h1 so forrest would generate the table of contents. Please let know 
>>> if I goofed anything up.
>>
>>
>>
>> Thank you. One problems is that the links are in Javadoc format (e.g. 
>> {@link org.apache.derby.impl.sql.execute.IndexRow IndexRows}) - 
>> because I meant this to be processed by Javadoc. I'd be grateful if 
>> you could modify the links to the relevant Javadoc page, e.g. above 
>> should cahnge to <a 
>> href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/sql/execute/IndexRow.html">IndexRows</a>. 
>> Sorry for the additional work.
> 
> 
> done. Committed revision 278667.
> 
> Except for this one link:
> 
>   {@link 
> org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_rows} 
> 
> 
> When I switched the javadoc format for that link to the URL below, it 
> got a page not found error:
> 
> <a 
> href="http://db.apache.org/derby/javadoc/engine/org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_rows.html">reclaim_deleted_rows</a> 

I emailed too quickly. Here's the actual link I tried (in email forgot 
to replace those dots with slashes):

<a 
href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/access/btree/BTreeController/reclaim_deleted_rows.html">reclaim_deleted_rows</a>

When these changes go live, you'll see two subsections now in the 
"Garbage Collection of deleted keys" section. When I switched the h2 
tags to h1, I missed two h3 tags that needed to be switched to h2.

  -jean

Re: BTree package documentation

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:
> Jean T. Anderson wrote:
> 
>> Committed revision 278552. I made one minor change of the h2 tags to 
>> h1 so forrest would generate the table of contents. Please let know if 
>> I goofed anything up.
> 
> 
> Thank you. One problems is that the links are in Javadoc format (e.g. 
> {@link org.apache.derby.impl.sql.execute.IndexRow IndexRows}) - because 
> I meant this to be processed by Javadoc. I'd be grateful if you could 
> modify the links to the relevant Javadoc page, e.g. above should cahnge 
> to <a 
> href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/sql/execute/IndexRow.html">IndexRows</a>. 
> Sorry for the additional work.

done. Committed revision 278667.

Except for this one link:

   {@link 
org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_rows}

When I switched the javadoc format for that link to the URL below, it 
got a page not found error:

<a 
href="http://db.apache.org/derby/javadoc/engine/org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_rows.html">reclaim_deleted_rows</a>

I didn't have time to chase it down, so left it in the @link form until 
someone has time to chase down the actual link.

> The other alternative would be to generate the appropriate Javadoc and 
> put in a link to that from the Papers section, similar to what you have 
> done for the Type System.

I think those pages would need to be generated in the code tree. And 
currently there's a problem with displaying html pages from the code 
tree for some browsers. --I documented a klunky workaround for 
displaying the Type System page on 
http://db.apache.org/derby/papers/index.html .

  -jean

Re: BTree package documentation

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean T. Anderson wrote:

> Committed revision 278552. I made one minor change of the h2 tags to 
> h1 so forrest would generate the table of contents. Please let know if 
> I goofed anything up.

Thank you. One problems is that the links are in Javadoc format (e.g. 
{@link org.apache.derby.impl.sql.execute.IndexRow IndexRows}) - because 
I meant this to be processed by Javadoc. I'd be grateful if you could 
modify the links to the relevant Javadoc page, e.g. above should cahnge 
to <a 
href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/sql/execute/IndexRow.html">IndexRows</a>. 
Sorry for the additional work.

The other alternative would be to generate the appropriate Javadoc and 
put in a link to that from the Papers section, similar to what you have 
done for the Type System.

Regards

Dibyendu

Re: BTree package documentation

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:
> Hello,
> 
> Attached is some documentation for the package 
> org.apache.derby.impl.store.access.btree.
> Most of it is based upon information provided by Mike.
> 
> Let me know if there are any inaccuracies. I think that there are some 
> areas where further details would be useful:
> 
> 1) How are duplicate keys handled in secondary indexes.
> 2) Some more detail on logging and recovery would be useful.
> 3) Internal transactions and how they are used to manage SMOs.
> 
> I would like this document to be made available also as a link in the 
> Papers section.

Committed revision 278552. I made one minor change of the h2 tags to h1 
so forrest would generate the table of contents. Please let know if I 
goofed anything up.

thanks for the contribution, Dibyendu.

  -jean