You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Dibyendu Majumdar <di...@mazumdar.demon.co.uk> on 2005/01/20 23:08:17 UTC

Re: Derby architecture/design documents

"Mike Matrigali" wrote:

> I am happy to answer any specific questions about the store module you
> have, or can find someone to answer them if I don't know.  I
> think questions should be directed to the mail list so that everyone
> can share in the discussion.  If I miss something feel free to ping
> me to get an answer.  I am an IBM employee still working on the derby
> codeline and was one of the original developers at Cloudscape when it
> first began as a startup.

Thank you very much, this will really help me to get started.

> Is there any model/standard in apache or opensource that we could follow
> or at least look at for a format for this kind of information.

I have some thoughts on how I want to do this - these are given below.

> I personally really like design information embedded in the actual code
> it relates to, rather than having separate documents.

I think that well documented code is always a good idea. And there is an
argument that documentation closer to the code is more likely to be kept
up-to-date. It is also a great help to anyone fixing the code.

However, separate documents are also very useful. Specially because you can
illustrate concepts and design with nice diagrams and flow charts. It also
enables people to grasp the design more easily, as things can be explained
top-down rather than jumping straight to implementation details.

> Is there any more specific area you are looking at, rather than just
> store.

> I usually divide the store into the following, some overlap:
> o store module interface (java protocol used by other modules to
> interact with store, this is actually fairly well documented in the
> javadoc already available)
> o access methods (btree/heap existing, future might be gist/rtree/...)
> o recovery (both restart and crash recovery, logging, xa recovery,  ...)
> o transactions (locking, isolation levels, xa transactions)
> o data (actual physical storage of rows/columns on pages, page
> management, cache  - the format of rows/columns/pages is reasonably
> documented in the code - I think in the javadoc, but might just be in
> comments)
> o misc (sorting, encryption, ...)

Well, I shall be documenting the whole of it., but will start with a
particular item, and then move on to other items.

I think that documentation should be at two levels.

The more useful level is where ideas and concepts are explained. I am sure
that Cloudscape was built upon ideas and concepts that had been invented
before by others, and perhaps built its own innovations upon those
foundations.
So I want to explore the origins of those ideas, where did the ideas come
from? How did Cloudscape implementation use those ideas? I'd like to give
some examples to illustrate what I mean.

If we take the transaction manager for a start. As far as I understand,
Derby uses the ARIES algorithm for write-ahead logging. What I'd like to
understand and document is how the Derby implementation of ARIES is
different from the published algorithms. Does Derby use nested top-level
actions or sub transactions to manage internal atomic operations? I think
I saw a comment in the code somehere that Derby does not yet
implement the transaction list in checkpoint records. I assume that Derby
does page-oriented redo during recovery redo pass, and then
page-oriented or logical undo during the undo pass.

For the BTREE implementation, I'd like to understand whether Derby
implements ARIES/IM or some other algorithm. How does it handle
structural consistency of the tree during concurrent updates? What are
the locking/latching implications of the algorithm used? Are accesses to
the entire tree or subtree serialised during structural modifications?
How is repeatable read ensured? Does Derby use data locking row or
index row locking? Does Derby delete rows logically first, and
physically afterwards? How does it determine that a deleted row can
be physically removed, ie, the transaction that deleted the row has
committed?

I understand that Derby uses Log Actions to perform both normal
operations and recovery operations. Thus a key insert is first encoded as
a log operation, logged, and then executed. This is very neat.

I also get the impression by reading the code that a BTREE is
implemented on top of a regular table like structure. This is also
very neat as it resuses all the code for handling rows, locking, etc.

These are a few examples, and I could list several more. What would be
very useful is if you and other IBM developers could identify the algorithms
that Derby uses. Perhaps you could provide references to published
research papers, books, and patents. I could start by writing a high level
overview of the concepts used by Derby, and gradually build up the
detail by reading  the code, and by asking questions in this forum.

The second level of documentation can be built from the comments
in the code. This is the implementation level where I can describe how
Derby actually implements something, what the file formats are, how the
write ahead log is implementated, what a conglomerate is, what a
record id and how it is different from slot id, etc.

In terms of starting this level of documentation, I would first like to
concentrate on the "raw" module, which is I suspect is at the heart
of all other modules.

Thanks and Regards

Dibyendu

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

Thanks, Dibyendu, for taking the lead on this technical documentation. The
results of this effort will be a valuable contribution.

If I can help facilitate your efforts in any way, either online or offline,
let me know.

>
> Well, I shall be documenting the whole of it., but will start with a
> particular item, and then move on to other items.
>
.....
>
> In terms of starting this level of documentation, I would first like to
> concentrate on the "raw" module, which is I suspect is at the heart
> of all other modules.
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

"Mike Matrigali" wrote:

> Ok, that's a lot of questions - not going to get to all, but here's a
> start.  This posting is going to try to answer the btree questions.
> I'll do the recovery questions next.  The compare/contrast with ARIES
> papers will take some time, may not get to that right away.

Thank you - your post was very generous. It will take me some time to
absorb it and come back with questions.

Regards

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

> > Is there any way I can setup a debug version of Derby that I can step
> > through? I tried doing this with the default jar file, but the debugger
> > complains
> > that there is no line number information. I tried to find out how to
> > build Derby with debug enabled, but cannot find the relevant
information.
> >

RPost wrote:

>
> I got it working in JBuilder6 by adding all of the jars in the jars\insane
> directory to the project properties. JBuilder then lets me add the path to
> the source as derby\engine. Then I can step into all of the Derby source,
> set breakpoints in it, etc.
>
> Don't know how to do it in Eclipse. I'm trying to follow the discourse
> between the two of you and keep winding up in the bowels myself. But
unless
> you think I can help I was planning on staying out of you guys way since
you
> are making pretty progress in spite of the difficulties.

I managed to build Derby with debug on. Added the following line to
to tools/ant/properties/defaultcompiler.properties:
debug=on
I used Sun's JDK to build Derby.

I am now able to set breakpoints in Derby when I use Eclipse.
All I had to do was add derby.jar to the classpath, and let Eclipse
know where to find the source.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> Dibyendu
>
> Is there any way I can setup a debug version of Derby that I can step
> through? I tried doing this with the default jar file, but the debugger
> complains
> that there is no line number information. I tried to find out how to
> build Derby with debug enabled, but cannot find the relevant information.
>

I got it working in JBuilder6 by adding all of the jars in the jars\insane
directory to the project properties. JBuilder then lets me add the path to
the source as derby\engine. Then I can step into all of the Derby source,
set breakpoints in it, etc.

Don't know how to do it in Eclipse. I'm trying to follow the discourse
between the two of you and keep winding up in the bowels myself. But unless
you think I can help I was planning on staying out of you guys way since you
are making pretty progress in spite of the difficulties.

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:

> As you are seeing it is hard to keep to one topic as they are all
> related.

I guess that jumping straight into BTree implementation is going to be
tough going as it is built upon so many other things. However, I did not
want
to stop the flow ...

Do you think that it is better to start with the lower level stuff and
gradually move up?

One of the problems of looking at comments in a large system like
Derby is that following references is difficult. I find that it helps to use
Eclipse because when the mouse is pointed at a method or a class name,
Eclipse shows the relevant Javadoc. I just ignore the compile errors and
use Eclipse as a class browser.

Is there any way I can setup a debug version of Derby that I can step
through? I tried doing this with the default jar file, but the debugger
complains
that there is no line number information. I tried to find out how to
build Derby with debug enabled, but cannot find the relevant information.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

As you are seeing it is hard to keep to one topic as they are all
related.  I didn't want to get into the space stuff, but since
you asked, here are some comment directed at those:

In both btree and heap rows are only marked deleted and then sometime
later the space is reclaimed.  Here is what triggers that work.

"online" during BTREE split:
~    Whenever there is not enough room on a leaf to do an insert the
~    code attempts to find space on the leaf, by seeing if it can reclaim
~    any committed deletes on that leaf.  That work only requires the
~    latch on the leaf and row locks.  It is expected that most of the
~    space reclaim done in the btree goes through this path.  This is
~    the code you see in BTreeController.reclaim_deleted_rows.

BTREE post commit work:
~    Whenever a delete operation deletes the last row from a leaf
~    page then a
~    BtreePostCommit job is queued to be executed after the transaction
~    which did the delete commits.  This work currently requires a
~    table level lock as page merges have not been implemented to be
~    allowed concurrent with other operations.  I believe many db's don't
~    even try to do page merges except when called from some sort of
~    reorg utility.  If all rows on page are purged, then it will move
~    to the free list and perform a merge to the tree.

~    It is expected that the normal space reclamation happens with row
~    locks during btree split, which is why not much work has been done
~    to optimize btree post commit path.

Heap post commit work:
~    When a delete operation deletes the last row from any heap page then
~    a HeapPostCommit job is queued to be executed after the transaction
~    which did the delete commits.  This work requires latch on the
~    page being processed and row locks on rows being deleted.  If all
~    rows are purged then page is put on a free list, if not then page
~    is marked as having more space for future inserts.



Dibyendu Majumdar wrote:
| Mike,
|
| My initial comments/questions follow, more to come ...
|
|
|>Btree's are based on ARIES.
|
|
| Do you mean that BTree implementation uses ARIES style logging and
| recovery, or does it implement the concurrency and recovery logic of
| ARIES/IM? Or both?
|
| Reading the code, I cannot see that ARIES/IM is used apart from the
| data row locking.
|
|
|>Structure modifications are all
|>isolated from other operations through the use of latches.
|
|
| I am trying to understand the splitting logic which requires structural
| change. It seems to me that it works as follows:
|
| If a key cannot be inserted in a page, then check if parent has enough
space
| to insert the split key.
| If not, go to root and split the tree top-down along the path that
leads to
| the page that the new key is being inserted at.
| Retry inserting the key.
|
| The logic repeats until the key is inserted.
|
| The basic idea seems to be that the split is deferred until it is ensured
| that space is available in the parent page.
| The exact mechanism is not clear to me as I am unable to follow the logic
| which uses recursion, and seems quite complex.
|
| The split seems to be managed as an "internal transaction" which is
| described as follows:
|
| "Start an internal transaction.  An internal transaction is a completely
| separate transaction from the current user transaction.  All work done
| in the internal transaction must be physical (ie. it can be undone
|  physically by the rawstore at the page level, rather than logically
|  undone like btree insert/delete operations).  The rawstore guarantee's
| that in the case of a system failure all open Internal transactions are
| first undone in reverse order, and then other transactions are undone
| in reverse order.
| Internal transactions are meant to implement operations which, if
|  interupted before completion will cause logical operations like tree
| searches to fail.  This special undo order insures that the state of
| the tree is restored to a consistent state before any logical undo
| operation which may need to search the tree is performed."
|
| FYI, I am looking at the following methods:
|
|
org.apache.derby.impl.store.access.btree.BTreeController.start_xact_and_dosp
| lit()
|
org.apache.derby.impl.store.access.btree.BranchControlRow.restartSplitFor()
| org.apache.derby.impl.store.access.btree.BranchControlRow.splitFor()
| org.apache.derby.impl.store.access.btree.LeafControlRow.splitFor()
|
|
|>In both the btree and the heap, delete are first executed by marking
|>a "deleted" bit.  This is to insure space on the page for abort, since
|>row level locking will allow other rows on the page to be modified
|>conncurrently with transaction executing the delete.  The system uses
|>a background daemon to schedule work after commit to reclaim the space
|>of the deleted rows.  A row marked deleted can be "purged" if one can
|>obtain a lock on it (if it was an uncommitted delete then the
|>transaction doing the commit would still have an exclusive lock on the
|>row).
|
|
| I assume you are referring to:
|
| org.apache.derby.impl.store.access.btree.BTreePostCommit.
|
| The comment in the code says:
|
| "The current space reclamation algorithm requires a table level
| lock on the btree - this is mostly because the shrink algorithm
| is not multi-user.  This lock is requested NOWAIT as it does
| not want to impedede normal operation on the table.  If the lock
| were to wait then the current lock manager livelock algorithm
| would block all subsequent lock requests on this btree even if
| they are compatible with the current holder of the lock."
|
| But I also find that deleted rows are also reclaimed during normal insert
| operations. For example, in:
|
|
org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_row
| s().
|
| The comment for this method says:
|
| "Attempt to reclaim committed deleted rows from the page.
| Get exclusive latch on page, and then loop backward through
| page searching for deleted rows which are committed.  The routine
| assumes that it is called from a transaction which cannot have
| deleted any rows on the page.  For each deleted row on the page
| it attempts to get an exclusive lock on the deleted row, NOWAIT.
| If it succeeds, and since this row did not delete the row then the
| row must have been deleted by a transaction which has committed, so
| it is safe to purge the row.  It then purges the row from the page.
| Note that this routine may remove all rows from the page, it will not
| attempt a merge in this situation.  This is because this routine is
| called from split which is attempting an insert on the given page, so
| it would be a waste to merge the page only to split it again to allow
| the insert of the row causing the split."
|
|
| Regards
|
| Dibyendu
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB9EaGEpeslyHqPs0RArj9AKCG5pVPY1U9yIvaB7f6vZSfXM9lxwCeODHt
uTYEODNS05NE0N/9tMpudXg=
=Niv+
-----END PGP SIGNATURE-----

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike,

My initial comments/questions follow, more to come ...

> Btree's are based on ARIES.

Do you mean that BTree implementation uses ARIES style logging and
recovery, or does it implement the concurrency and recovery logic of
ARIES/IM? Or both?

Reading the code, I cannot see that ARIES/IM is used apart from the
data row locking.

> Structure modifications are all
> isolated from other operations through the use of latches.

I am trying to understand the splitting logic which requires structural
change. It seems to me that it works as follows:

If a key cannot be inserted in a page, then check if parent has enough space
to insert the split key.
If not, go to root and split the tree top-down along the path that leads to
the page that the new key is being inserted at.
Retry inserting the key.

The logic repeats until the key is inserted.

The basic idea seems to be that the split is deferred until it is ensured
that space is available in the parent page.
The exact mechanism is not clear to me as I am unable to follow the logic
which uses recursion, and seems quite complex.

The split seems to be managed as an "internal transaction" which is
described as follows:

"Start an internal transaction.  An internal transaction is a completely
separate transaction from the current user transaction.  All work done
in the internal transaction must be physical (ie. it can be undone
 physically by the rawstore at the page level, rather than logically
 undone like btree insert/delete operations).  The rawstore guarantee's
that in the case of a system failure all open Internal transactions are
first undone in reverse order, and then other transactions are undone
in reverse order.
Internal transactions are meant to implement operations which, if
 interupted before completion will cause logical operations like tree
searches to fail.  This special undo order insures that the state of
the tree is restored to a consistent state before any logical undo
operation which may need to search the tree is performed."

FYI, I am looking at the following methods:

org.apache.derby.impl.store.access.btree.BTreeController.start_xact_and_dosp
lit()
org.apache.derby.impl.store.access.btree.BranchControlRow.restartSplitFor()
org.apache.derby.impl.store.access.btree.BranchControlRow.splitFor()
org.apache.derby.impl.store.access.btree.LeafControlRow.splitFor()

> In both the btree and the heap, delete are first executed by marking
> a "deleted" bit.  This is to insure space on the page for abort, since
> row level locking will allow other rows on the page to be modified
> conncurrently with transaction executing the delete.  The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows.  A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the
> row).

I assume you are referring to:

org.apache.derby.impl.store.access.btree.BTreePostCommit.

The comment in the code says:

"The current space reclamation algorithm requires a table level
lock on the btree - this is mostly because the shrink algorithm
is not multi-user.  This lock is requested NOWAIT as it does
not want to impedede normal operation on the table.  If the lock
were to wait then the current lock manager livelock algorithm
would block all subsequent lock requests on this btree even if
they are compatible with the current holder of the lock."

But I also find that deleted rows are also reclaimed during normal insert
operations. For example, in:

org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_row
s().

The comment for this method says:

"Attempt to reclaim committed deleted rows from the page.
Get exclusive latch on page, and then loop backward through
page searching for deleted rows which are committed.  The routine
assumes that it is called from a transaction which cannot have
deleted any rows on the page.  For each deleted row on the page
it attempts to get an exclusive lock on the deleted row, NOWAIT.
If it succeeds, and since this row did not delete the row then the
row must have been deleted by a transaction which has committed, so
it is safe to purge the row.  It then purges the row from the page.
Note that this routine may remove all rows from the page, it will not
attempt a merge in this situation.  This is because this routine is
called from split which is attempting an insert on the given page, so
it would be a waste to merge the page only to split it again to allow
the insert of the row causing the split."


Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:

> Some information to expand on the topics below (not all of them, but a
> start), can I suggest you start to maintain some sort of TODO list for
> gathering information, I will update as I get a time:

Will do. I hope to produce some stuff this week, containing all the material
you have described so far.

Thank you again for taking the time to explain all this.

I will follow up with questions once I have converted what is already
available into some sort of documentation.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I am going to defer on page version vs. LSN question, but at least
mention some guesses, not researched info.  You are right
about what exists.  I spoke with some colleagues and the best we can
come up with is that the implementor wanted to separate the page and
the log, in case we ever did a different log format.  I will try to
do some more research here.  I also vaguely remember the implementor
mentioning if we ever wanted to implement the LSN on the page, we
had space to do so.  It may simply have been easier to code the page
versions, since in the current system the LSN is the address in the log
file (which and it may not be available when the code wants to write it
on the page).

As you say in derby all the log records are associated with a page, and
thus have a page version number.  That page version number in the log
is compared with the page version number of the page during redo to
determine if the log record needs to be applied.  This also has helped
us in the past to catch some bugs as we can sanity check during redo
that the page is progressing along the expected path, ie. it is a bug
during redo to be applying a page version 10 log record to page that is
at page version 8.  I haven't seen this sanity check in many years, but
was useful when the product was first coded.

The original requirement for the identifier of a record in cloudscape
was that it must be a unique id, associated with the row, for the
lifetime of the row.  The identifier was meant to survive table
reorganization also.  Originally, cloudscape was being designed as
a object relational system, and the plan was that these record id's
might be used as object id's (or some part of an object id), that might
be exported outside the db.  This use really never came to pass.  For
row level locking, the record identifier is used as the lock, so it is
important that the id never get reused.

In derby the extra level of indirection was added to meet the original
design goals.   The slot table entry could be used if we never reclaimed
a slot table entry, currently in derby the slot table entries are
automatically reclaimed as part of the space management algorithms.
During runtime the system maintains runtime information mapping id's to
slot table entries - but as you say it would be more efficient if the
id pointed directly at the offset table.  It is a space vs. runtime
tradeoff.

Dibyendu Majumdar wrote:

> Mike,
> 
> A couple of questions for you.
> 
> While documenting the page formats, I noticed that Derby does not persist
> the LogInstant (LSN) associated with a page. In a classic ARIES
> implementation, each page is supposed to be tagged with the LSN of the log
> record that last updated it, so that during recovery redo, the LSN can be
> used to determine if a particular log record has already been applied to the
> page.
> 
> Derby appears to use PageVersion for this purpose, which is persisted and is
> also present in log records. Any reason why PageVersion is used and not
> LogInstant?
> 
> Another thing that has puzzled me is the use of RecordIds. I would have
> assumed that the purpose of the slot table in a page is provide a stable
> TupleId (as described in the classic TPCT book by Jim Gray and Andreas
> Reuter), so that there is freedom to move records around in the page, and
> still have a stable TupleId that does not change over time. Also, the
> slot-no allows a particular record to be located efficiently. Given that a
> page in Derby has a slot table, why is the slot-no not used as part of the
> RecordId? Using a separate id is obviously inefficient, as it must be
> searched for.
> 
> Thanks and Regards
> 
> Dibyendu
> 
> 
> 
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike,

A couple of questions for you.

While documenting the page formats, I noticed that Derby does not persist
the LogInstant (LSN) associated with a page. In a classic ARIES
implementation, each page is supposed to be tagged with the LSN of the log
record that last updated it, so that during recovery redo, the LSN can be
used to determine if a particular log record has already been applied to the
page.

Derby appears to use PageVersion for this purpose, which is persisted and is
also present in log records. Any reason why PageVersion is used and not
LogInstant?

Another thing that has puzzled me is the use of RecordIds. I would have
assumed that the purpose of the slot table in a page is provide a stable
TupleId (as described in the classic TPCT book by Jim Gray and Andreas
Reuter), so that there is freedom to move records around in the page, and
still have a stable TupleId that does not change over time. Also, the
slot-no allows a particular record to be located efficiently. Given that a
page in Derby has a slot table, why is the slot-no not used as part of the
RecordId? Using a separate id is obviously inefficient, as it must be
searched for.

Thanks and Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Some information to expand on the topics below (not all of them, but a
start), can I suggest you start to maintain some sort of TODO list for
gathering information, I will update as I get a time:

container/conglomerate/table - This terminology comes from the original
modular design of the system.  The store was really 2 modules: the
lower level raw store and then the upper level access.

Raw store uses containers to store rows.  Currently these containers
always map to a single file in the seg0 directory of the database.

Access provides conglomerates as the interface to the it's clients for
storing rows.  Currently there is a one to one mapping between a
conglomerate and a container.

The language level implements SQL tables and indexes using Access
provided Conglomerates.

The layers allow for some decision in the future if derby is ever to
support tables and/or indexes spread across multiple disks.  The
implementation could either happen at the language layer, by spreading
the table across multiple conglomerates; or it could happen at the store
 level by spreading it across multiple containers.  My opinion is that
if the table is fragmented by key, then language should do it as it best
understands the key's and indexes involved.  If a raw partitioning of
the data is desired then it would be best done in the store (but I
actually think that a raw partitioning is best done below the database
itself by the OS or hardware).

Latches - Latches are used in derby.  They are short term locks on the
page in the buffer cache.  They are requested anytime the raw store
needs to read/write information on the page.  They are short term and
never held during any operation which can wait (like an I/O or a lock
request).  Latches are implemented using the Derby lock manager.

page, row and field formats - check out comments at top of
java/engine/org/apache/derby/impl/store/raw/data/StoredPage.java and
StoredFieldHeader.java, let me know if you need more info.

Handling of large rows -
The terminolgy used in raw store is long rows and long columns.  A
column is long if it can't fit on a single page.  A row is long if all
of it's columns can't fit on a single page.  A long column is marked as
long in the base row, and it's field contains a pointer to a chain of
other rows in the same container with contain the data of the row.  Each
of the subsequent rows is on a page to itself.  Each subsquent row,
except for the last piece has 2 columns, the first is the next segment
of the row and the second is the pointer to the the following segment.
The last segment only has the data segment.

Similarly for a long row, the segment of the row which fits on the page
is left there, and a pointer column is added at the end of the row.  It
points to another row in the same container on a different page.  That
row will contain the next set of columns and a continuation pointer if
necessary.  The overflow portion will be on an "overflow" page, and that
page may have overflow portions of other rows on it (unlike overflow
columns).

Dibyendu Majumdar wrote:

> Here are some ideas on how I would like to structure the documentation.
> 
> 
> For every topic, I'd like to create:
> 
> Introduction to concepts - A general introduction to the topic.
> Derby implementation details - This will be main substance of the document
> where I will describe how Derby works.
> References - these will point to published papers, books, etc. that discuss
> the topic in question.
> 
> In terms of topics, here is what I have come up with (let me know if things
> should be added):
> 
> Terms - this will be a glossary of Derby specific terms so that I don't have
> to keep explaining the same terms in every document.
> 
> Row ID - How rows are identified - RecordId and SlotId.
> 
> Row management within a page - storage format of rows, slot table, etc.
> 
> Handling of large rows - how does Derby handle rows that won't fit into one
> page.
> 
> Container - what is a container?
> 
> Space management in Containers - how is space management implemented? How
> does Derby locate an empty page?
> 
> Latches - are latches used by Derby? How are they implemented?
> 
> Lock management - description of lock management, lock conflict resolution,
> deadlocks, lock escalation.
> 
> Buffer cache - how does Derby implement the buffer cache, and what is
> interaction between Buffer cache and Log, and Buffer cache and Transaction
> manager.
> 
> Write ahead log - description of how the log is implemented - this would
> mainly cover the format of log records, how log records are represented in
> memory and in disk, log archiving, checkpointing, etc.
> 
> Transactions - how is an Xid allocated? What is the representation of a
> transaction? How is a transaction related to a thread?
> 
> Transaction Manager - description of how Derby implements ARIES. What
> happens at system restart. How rollbacks and commits work. Different types
> of log records used by the transaction manager - such as do, redo, undo,
> compensation, checkpoint, etc.
> 
> Row locking in tables - how are rows locked? What happens when a row spans
> pages?
> 
> Row recovery - Do/Redo/Undo actions for rows - inserts, updates, deletes.
> 
> BTree - page organisation and structure
> 
> BTree - concurrency - how does Derby handle concurrent updates to the tree -
> inserts and deletes? How are structural changes serialised? Do updates block
> readers (not as in locking but while the change is being made) or can they
> progress concurrently?
> 
> BTree - locking - data row locking or index row loacking? Is next-key
> locking used for serializable reads?
> 
> BTree - recovery - Do/Redo/Undo actions for key inserts, updates, deletes.
> 
> Row scans on tables - is this handled by "store"?
> 
> Row scans in BTrees - is this handled by "store"?
> 
> Conglomerates - what is a Conglomerate?
> 
> 
> 
> 
> 
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Here are some ideas on how I would like to structure the documentation.


For every topic, I'd like to create:

Introduction to concepts - A general introduction to the topic.
Derby implementation details - This will be main substance of the document
where I will describe how Derby works.
References - these will point to published papers, books, etc. that discuss
the topic in question.

In terms of topics, here is what I have come up with (let me know if things
should be added):

Terms - this will be a glossary of Derby specific terms so that I don't have
to keep explaining the same terms in every document.

Row ID - How rows are identified - RecordId and SlotId.

Row management within a page - storage format of rows, slot table, etc.

Handling of large rows - how does Derby handle rows that won't fit into one
page.

Container - what is a container?

Space management in Containers - how is space management implemented? How
does Derby locate an empty page?

Latches - are latches used by Derby? How are they implemented?

Lock management - description of lock management, lock conflict resolution,
deadlocks, lock escalation.

Buffer cache - how does Derby implement the buffer cache, and what is
interaction between Buffer cache and Log, and Buffer cache and Transaction
manager.

Write ahead log - description of how the log is implemented - this would
mainly cover the format of log records, how log records are represented in
memory and in disk, log archiving, checkpointing, etc.

Transactions - how is an Xid allocated? What is the representation of a
transaction? How is a transaction related to a thread?

Transaction Manager - description of how Derby implements ARIES. What
happens at system restart. How rollbacks and commits work. Different types
of log records used by the transaction manager - such as do, redo, undo,
compensation, checkpoint, etc.

Row locking in tables - how are rows locked? What happens when a row spans
pages?

Row recovery - Do/Redo/Undo actions for rows - inserts, updates, deletes.

BTree - page organisation and structure

BTree - concurrency - how does Derby handle concurrent updates to the tree -
inserts and deletes? How are structural changes serialised? Do updates block
readers (not as in locking but while the change is being made) or can they
progress concurrently?

BTree - locking - data row locking or index row loacking? Is next-key
locking used for serializable reads?

BTree - recovery - Do/Redo/Undo actions for key inserts, updates, deletes.

Row scans on tables - is this handled by "store"?

Row scans in BTrees - is this handled by "store"?

Conglomerates - what is a Conglomerate?

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

RPost wrote:
> For some reason I interpreted Mike's comments to mean that Derby would not
> obtain any latches until AFTER it had recursed up the tree to find the
> highest node that needed to be split.
> 
> Then it would:
> 
> 1. obtain latches on that highest-most parent and child node
> 2. perform the split as an internal transaction
> 3. release the latches on the parent and child (why not keep the child latch
> if it is needed)
> 4. then move back down the tree to the child, which now becomes the parent
> and repeats the process at step 1 above until completed.
> 
> This approach only obtains latches immediately before they are used to
> perform a split.

I am not sure whether it starts from the root or recurses up the tree 
one level at a time. Mike, please help!

Regards

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Dibyendu Majumdar" wrote:

> The question is what happens if parent P is also full and needs to be
> split. This is where things get really complicated. If my understanding
> is correct, Derby solves this in a simple and elegant way.
>
> It releases the latches on C and P, and starts the split at P with the
> separator key. So, at that point, P becomes C, and P's parent, let us
> say Q, becomes P.
>
> I think that this is repeated recursively up the tree until the original
> P has been split with room to insert the separator key. Then Derby
> proceeds with the original split.
>

This description would work and may indeed be what Mike meant.

For some reason I interpreted Mike's comments to mean that Derby would not
obtain any latches until AFTER it had recursed up the tree to find the
highest node that needed to be split.

Then it would:

1. obtain latches on that highest-most parent and child node
2. perform the split as an internal transaction
3. release the latches on the parent and child (why not keep the child latch
if it is needed)
4. then move back down the tree to the child, which now becomes the parent
and repeats the process at step 1 above until completed.

This approach only obtains latches immediately before they are used to
perform a split.

Is it necessary to latch a page in order to determine if the page is full
and must be split? If not, why obtain latches before recursing to the top
page that must be split?

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I will be submitting compare/contrast paper with ARIES/IM.  hopefully
the following will answer this question.

Unlike ARIES/IM SMO (structure modification operations - ie. split),
only happen top down.  This is not as concurrent as bottom up
ARIES/IM, but was simpler.  Do note though that as in ARIES IM "Not more
than 2 index pages are held latched simultaneously at anytime.  In order
to improve concurrency and to avoid deadlocks involving latches, even
those latches are not held while waiting for a lock wich is not
immediately grantable.  No data page latch is held or acquired during an
index access.  Latch coupliing is used while traversing the triee - ie.
the latch on a parent page is held while requesting a latch on a child
page."

So simple case of split exclusive latch is held on P (parent), and
C (child).  If there is room in P, then new page C1 (child right) is
created, new descriminator is inserted into P, rows moved C1.  Latches
are held on P and C for duration of split, and then released.

Hard case is P does not have room for descriminator key.  In this case
all latches are released, and derby does a split pass from top to
bottom, and will split internal node that does not have room for the
decrimator key.  Note this may result in more splits than necessary for
this particular insert, but the assumption is that the splits will have
to happen eventually anyway.  After this split pass is done, the search
for the insert starts again from top down, but it must once again check
for space because it has given up all it's latches and someone else may
have gotten to the space first.

Optimization is possible to remember C and see if it is right location,
and/or use sideway pointers to search right rather than do research of
tree.

Dibyendu Majumdar wrote:

> Philip Bohannon wrote some time back:
> 
>> The classic problem here is a split taking place during a search.  Say  a
>> page C contains 50-100, and C's parent in the tree is P, which has an
>> entry for 50 and 101, with the entry for 50 pointing to C.
>>
>> Now an inserter comes in, inserts 51 and fills up C (or fills up C's
>> child, causes a split, and that fills up C).  So C splits into C and C',
>> which is added to the right of C.
>>
>> Since the inserter can't get a lock on P while holding a lock on C
>> (deadlock), it releases the lock on P.  At this point, a reader comes
>> down
>> the tree looking for 99, looks at P and gets a pointer to C not C'. Now
>> inserter starts waiting for a lock on P.  When reader gets C, 99 has been
>> moved off to C'.
>>
>> It would be interesting to know how Derby handles this (for example,
>> [Yao]
>> proposes having a pointer from C to C', and I forget what Aries IM
>> does at
>> the moment, but I think it starts over at the top if you fail to find a
>> key at the rightmost point in the page...).
> 
> 
> Philip,
> 
> I don't fully understand how Derby works, but from what I have seen so
> far, I think that Derby avoids above situation as follows:
> 
> It latches the parent P first and then the child C exclusively - and
> only then starts the split. The searcher has to wait until the split is
> over.
> 
> In Mike's words:
> 
> [In order to prevent deadlocks latches requested while holding other
> latches are always requested top/down and left to right.  Btree splits
> are always left to right.  If for any reason the code needs to break
> this protocol then it will first request the latch NOWAIT and if it
> can't get the latch it will release all current latches and wait for the
> latch it is trying to get, and then after obtaining it go about getting
> the other latches it needs for the particular operation.  While
> traversing down the tree Derby may hold 2 latches: one on parent and one
> on child.]
> 
> The question is what happens if parent P is also full and needs to be
> split. This is where things get really complicated. If my understanding
> is correct, Derby solves this in a simple and elegant way.
> 
> It releases the latches on C and P, and starts the split at P with the
> separator key. So, at that point, P becomes C, and P's parent, let us
> say Q, becomes P.
> 
> I think that this is repeated recursively up the tree until the original
> P has been split with room to insert the separator key. Then Derby
> proceeds with the original split.
> 
> Perhaps Mike or someone else can confirm or fill in with more details.
> 
> I have not seen above algorithm in any of the papers I have read - but
> then, my reading is limited to following papers:
> 
> C.Mohan and Frank Levine. ARIES/IM: an efficient and high concurrency
> index management method with write-ahead logging. ACM SIGMOD Record. V21
> N2, P371-380, June 1 1992.
> 
> Jim Gray and Andreas Reuter. Chapter 15: Access Paths. Transaction
> Processing: Concepts and Techniques. Morgan Kaufmann Publishers, 1993
> 
> Philip L. Lehman and S.Bing Yao. Efficient Locking for Concurrent
> Operations on B-Trees. Readings in Database Systems, Third Edition,
> 1998. Morgan Kaufmann Publishers.
> 
> David Lomet and Betty Salzburg. Access method concurrency with recovery.
> ACM SIGMOD, V21 N2, p351-360, June 1 1992.
> 
> 
> Regards
> 
> Dibyendu
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Philip Bohannon wrote some time back:
> The classic problem here is a split taking place during a search.  Say  a
> page C contains 50-100, and C's parent in the tree is P, which has an
> entry for 50 and 101, with the entry for 50 pointing to C.
> 
> Now an inserter comes in, inserts 51 and fills up C (or fills up C's
> child, causes a split, and that fills up C).  So C splits into C and C',
> which is added to the right of C.
> 
> Since the inserter can't get a lock on P while holding a lock on C
> (deadlock), it releases the lock on P.  At this point, a reader comes down
> the tree looking for 99, looks at P and gets a pointer to C not C'. Now
> inserter starts waiting for a lock on P.  When reader gets C, 99 has been
> moved off to C'.
> 
> It would be interesting to know how Derby handles this (for example, [Yao]
> proposes having a pointer from C to C', and I forget what Aries IM does at
> the moment, but I think it starts over at the top if you fail to find a
> key at the rightmost point in the page...).

Philip,

I don't fully understand how Derby works, but from what I have seen so 
far, I think that Derby avoids above situation as follows:

It latches the parent P first and then the child C exclusively - and 
only then starts the split. The searcher has to wait until the split is 
over.

In Mike's words:

[In order to prevent deadlocks latches requested while holding other 
latches are always requested top/down and left to right.  Btree splits 
are always left to right.  If for any reason the code needs to break 
this protocol then it will first request the latch NOWAIT and if it 
can't get the latch it will release all current latches and wait for the 
latch it is trying to get, and then after obtaining it go about getting 
the other latches it needs for the particular operation.  While 
traversing down the tree Derby may hold 2 latches: one on parent and one 
on child.]

The question is what happens if parent P is also full and needs to be 
split. This is where things get really complicated. If my understanding 
is correct, Derby solves this in a simple and elegant way.

It releases the latches on C and P, and starts the split at P with the 
separator key. So, at that point, P becomes C, and P's parent, let us 
say Q, becomes P.

I think that this is repeated recursively up the tree until the original 
P has been split with room to insert the separator key. Then Derby 
proceeds with the original split.

Perhaps Mike or someone else can confirm or fill in with more details.

I have not seen above algorithm in any of the papers I have read - but 
then, my reading is limited to following papers:

C.Mohan and Frank Levine. ARIES/IM: an efficient and high concurrency 
index management method with write-ahead logging. ACM SIGMOD Record. V21 
N2, P371-380, June 1 1992.

Jim Gray and Andreas Reuter. Chapter 15: Access Paths. Transaction 
Processing: Concepts and Techniques. Morgan Kaufmann Publishers, 1993

Philip L. Lehman and S.Bing Yao. Efficient Locking for Concurrent 
Operations on B-Trees. Readings in Database Systems, Third Edition, 
1998. Morgan Kaufmann Publishers.

David Lomet and Betty Salzburg. Access method concurrency with recovery. 
ACM SIGMOD, V21 N2, p351-360, June 1 1992.


Regards

Dibyendu

RE: Derby architecture/design documents

Posted by Philip Bohannon <bo...@lucent.com>.

<delurk>
Wow, super help.

(Hi, my name is Philip Bohannon.  I work at Bell Labs and have built and
debugged some multi-level concurrency control & recovery schemes in the
past.  Thanks to Mike for peppering you with so many coherent
questions...)

I am adding a couple of questions and comments below.  One general comment
is that it might help to add an occasional pointer to the code; that is,
mention which class/methods implement the algorithm you are outlining.
With an overview like this, it will be vastly more pleasant perusing
around!


> -----Original Message-----
> From: Mike Matrigali [mailto:mikem_app@sbcglobal.net]
> Sent: Friday, January 21, 2005 2:37 PM
> To: Derby Development
> Subject: Re: Derby architecture/design documents
>
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ok, that's a lot of questions - not going to get to all, but here's a
> start.  This posting is going to try to answer the btree questions.
> I'll do the recovery questions next.  The compare/contrast with ARIES
> papers will take some time, may not get to that right away.
>
> Your suggestion of starting at high level, comparing/contrasting with
> published research should work well with the btree and recovery design.
> As you stated both of these areas in derby are based on the ARIES
> algorithms.
> I will schedule some time in the future to look carefully at those
> papers and produce the compare/contrast for those 2 areas.
>
> Here are some answers off the top of my head for the btree questions,
> that area is most familar to me:
>
> Btree's are based on ARIES.  The btree is a tree of pages, linked to
> one another in the standard way.  Derby uses latches on pages to get
> exclusive access to the page while reading or writing the page (Derby
> only uses exclusive latches, no shared latches are used).  In order
> to prevent deadlocks latches requested while holding other latches are
> always requested top/down and left to right.  Btree splits are always
> left to right.  If for any reason the code needs to break this protocol
> then it will first request the latch NOWAIT and if it can't get the
> latch it will release all current latches and wait for the latch it is
> trying to get, and then after obtaining it go about getting the other
> latches it needs for the particular operation.  While traversing down
> the tree Derby may hold 2 latches: one on parent and one on child.  It
> then continues doing "ladder" latching down the tree releasing the
> highest node when it has successfully got a new lower node latch.
> Latches are short term, only held while reading/modifying the page,
> never held while an I/O is happening.  Structure modifications are all
> isolated from other operations through the use of latches.

The classic problem here is a split taking place during a search.  Say  a
page C contains 50-100, and C's parent in the tree is P, which has an
entry for 50 and 101, with the entry for 50 pointing to C.

Now an inserter comes in, inserts 51 and fills up C (or fills up C's
child, causes a split, and that fills up C).  So C splits into C and C',
which is added to the right of C.

Since the inserter can't get a lock on P while holding a lock on C
(deadlock), it releases the lock on P.  At this point, a reader comes down
the tree looking for 99, looks at P and gets a pointer to C not C'. Now
inserter starts waiting for a lock on P.  When reader gets C, 99 has been
moved off to C'.

It would be interesting to know how Derby handles this (for example, [Yao]
proposes having a pointer from C to C', and I forget what Aries IM does at
the moment, but I think it starts over at the top if you fail to find a
key at the rightmost point in the page...).

>
> Derby uses data only locking for it's logical row level locking.  All
> isolation level implementation is done using logical locks (Derby has
> no non-locking isolation support like versioning).  A quick rundown of
> how locks are used to implement isolation level follows, for this
> discussion I will use the jdbc isolation notation.  In data only locking
> the key for all locks whether obtained in the btree or in the base table
> is the address of the base table row.  In the btree case the address of
> the base table row is always the last column of the leaf index entry.
:
> serializable: Derby uses "previous key" locking to prevent phantoms from
> ~              being inserted into a range being accessed by a scan.  A
> ~               	      scan always locks the index row just "previous"
to
> the  			      first entry that satisfies the scan condition and then
> all 	      index entries going forward until the end of the scan.
> ~    All row locks held until end of transaction.  Inserts also
> ~              get previous key locks, which will block if the existing
> ~              row is locked.
>
> repeatable read:  Same as serializable except that no previous key
> ~              	      locking is done.
>
> read committed: write locks held until end of transaction.  Read locks
> ~               are released once the query in question no longer needs
> ~               them.
>
> read uncommitted: no row locks are gotten.  The code still gets table
> level intent locks to prevent concurrent ddl during the query.
>
> In both the btree and the heap, delete are first executed by marking
> a "deleted" bit.  This is to insure space on the page for abort, since
> row level locking will allow other rows on the page to be modified
> conncurrently with transaction executing the delete.  The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows.  A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the
> row).

One small note: at recovery time usually such locks are not held.  Thus it
would be required not to restart this garbage collector until all aborted
transactions are rolled back.  And of course, the garbage collector would
need to get the lock, then re-check the bit to see if it is still 1.

Of course, I have no reason to think Derby doesn't handle such cases, but
asking the questions will at least help me learn :).

</delurk>

Philip Bohannon
Computing Sciences Research Center
Lucent Technologies -- Bell Laboratories

>
>
>
> Dibyendu Majumdar wrote:
> | "Mike Matrigali" wrote:
> |
> |
> |>I am happy to answer any specific questions about the store module you
> |>have, or can find someone to answer them if I don't know.  I
> |>think questions should be directed to the mail list so that everyone
> |>can share in the discussion.  If I miss something feel free to ping
> |>me to get an answer.  I am an IBM employee still working on the derby
> |>codeline and was one of the original developers at Cloudscape when it
> |>first began as a startup.
> |
> |
> | Thank you very much, this will really help me to get started.
> |
> |
> |>Is there any model/standard in apache or opensource that we could
follow
> |>or at least look at for a format for this kind of information.
> |
> |
> | I have some thoughts on how I want to do this - these are given below.
> |
> |
> |>I personally really like design information embedded in the actual
code
> |>it relates to, rather than having separate documents.
> |
> |
> | I think that well documented code is always a good idea. And there is
an
> | argument that documentation closer to the code is more likely to be
kept
> | up-to-date. It is also a great help to anyone fixing the code.
> |
> | However, separate documents are also very useful. Specially because
> you can
> | illustrate concepts and design with nice diagrams and flow charts. It
also
> | enables people to grasp the design more easily, as things can be
explained
> | top-down rather than jumping straight to implementation details.
> |
> |
> |>Is there any more specific area you are looking at, rather than just
> |>store.
> |
> |
> |>I usually divide the store into the following, some overlap:
> |>o store module interface (java protocol used by other modules to
> |>interact with store, this is actually fairly well documented in the
> |>javadoc already available)
> |>o access methods (btree/heap existing, future might be gist/rtree/...)
> |>o recovery (both restart and crash recovery, logging, xa recovery,
...)
> |>o transactions (locking, isolation levels, xa transactions)
> |>o data (actual physical storage of rows/columns on pages, page
> |>management, cache  - the format of rows/columns/pages is reasonably
> |>documented in the code - I think in the javadoc, but might just be in
> |>comments)
> |>o misc (sorting, encryption, ...)
> |
> |
> | Well, I shall be documenting the whole of it., but will start with a
> | particular item, and then move on to other items.
> |
> | I think that documentation should be at two levels.
> |
> | The more useful level is where ideas and concepts are explained. I am
sure
> | that Cloudscape was built upon ideas and concepts that had been
invented
> | before by others, and perhaps built its own innovations upon those
> | foundations.
> | So I want to explore the origins of those ideas, where did the ideas
come
> | from? How did Cloudscape implementation use those ideas? I'd like to
give
> | some examples to illustrate what I mean.
> |
> | If we take the transaction manager for a start. As far as I
understand,
> | Derby uses the ARIES algorithm for write-ahead logging. What I'd like
to
> | understand and document is how the Derby implementation of ARIES is
> | different from the published algorithms. Does Derby use nested
top-level
> | actions or sub transactions to manage internal atomic operations? I
think
> | I saw a comment in the code somehere that Derby does not yet
> | implement the transaction list in checkpoint records. I assume that
Derby
> | does page-oriented redo during recovery redo pass, and then
> | page-oriented or logical undo during the undo pass.
> |
> | For the BTREE implementation, I'd like to understand whether Derby
> | implements ARIES/IM or some other algorithm. How does it handle
> | structural consistency of the tree during concurrent updates? What are
> | the locking/latching implications of the algorithm used? Are accesses
to
> | the entire tree or subtree serialised during structural modifications?
> | How is repeatable read ensured? Does Derby use data locking row or
> | index row locking? Does Derby delete rows logically first, and
> | physically afterwards? How does it determine that a deleted row can
> | be physically removed, ie, the transaction that deleted the row has
> | committed?
> |
> | I understand that Derby uses Log Actions to perform both normal
> | operations and recovery operations. Thus a key insert is first encoded
as
> | a log operation, logged, and then executed. This is very neat.
> |
> | I also get the impression by reading the code that a BTREE is
> | implemented on top of a regular table like structure. This is also
> | very neat as it resuses all the code for handling rows, locking, etc.
> |
> | These are a few examples, and I could list several more. What would be
> | very useful is if you and other IBM developers could identify the
> algorithms
> | that Derby uses. Perhaps you could provide references to published
> | research papers, books, and patents. I could start by writing a high
level
> | overview of the concepts used by Derby, and gradually build up the
> | detail by reading  the code, and by asking questions in this forum.
> |
> | The second level of documentation can be built from the comments
> | in the code. This is the implementation level where I can describe how
> | Derby actually implements something, what the file formats are, how
the
> | write ahead log is implementated, what a conglomerate is, what a
> | record id and how it is different from slot id, etc.
> |
> | In terms of starting this level of documentation, I would first like
to
> | concentrate on the "raw" module, which is I suspect is at the heart
> | of all other modules.
> |
> | Thanks and Regards
> |
> | Dibyendu
> |
> |
> |
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.5 (MingW32)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQFB8VntEpeslyHqPs0RAvamAKCz+Krfmgtb7d2e6MrU68Zg1jgSBACg6lbM
> Ft/o3N++pWYHO7HzEZZzXQ0=
> =l20l
> -----END PGP SIGNATURE-----
>

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ok, that's a lot of questions - not going to get to all, but here's a
start.  This posting is going to try to answer the btree questions.
I'll do the recovery questions next.  The compare/contrast with ARIES
papers will take some time, may not get to that right away.

Your suggestion of starting at high level, comparing/contrasting with
published research should work well with the btree and recovery design.
As you stated both of these areas in derby are based on the ARIES
algorithms.
I will schedule some time in the future to look carefully at those
papers and produce the compare/contrast for those 2 areas.

Here are some answers off the top of my head for the btree questions,
that area is most familar to me:

Btree's are based on ARIES.  The btree is a tree of pages, linked to
one another in the standard way.  Derby uses latches on pages to get
exclusive access to the page while reading or writing the page (Derby
only uses exclusive latches, no shared latches are used).  In order
to prevent deadlocks latches requested while holding other latches are
always requested top/down and left to right.  Btree splits are always
left to right.  If for any reason the code needs to break this protocol
then it will first request the latch NOWAIT and if it can't get the
latch it will release all current latches and wait for the latch it is
trying to get, and then after obtaining it go about getting the other
latches it needs for the particular operation.  While traversing down
the tree Derby may hold 2 latches: one on parent and one on child.  It
then continues doing "ladder" latching down the tree releasing the
highest node when it has successfully got a new lower node latch.
Latches are short term, only held while reading/modifying the page,
never held while an I/O is happening.  Structure modifications are all
isolated from other operations through the use of latches.

Derby uses data only locking for it's logical row level locking.  All
isolation level implementation is done using logical locks (Derby has
no non-locking isolation support like versioning).  A quick rundown of
how locks are used to implement isolation level follows, for this
discussion I will use the jdbc isolation notation.  In data only locking
the key for all locks whether obtained in the btree or in the base table
is the address of the base table row.  In the btree case the address of
the base table row is always the last column of the leaf index entry.   :
serializable: Derby uses "previous key" locking to prevent phantoms from
~              being inserted into a range being accessed by a scan.  A
~               	      scan always locks the index row just "previous" to
the  			      first entry that satisfies the scan condition and then
all 	      index entries going forward until the end of the scan.  			
~    All row locks held until end of transaction.  Inserts also
~              get previous key locks, which will block if the existing 		
~              row is locked.

repeatable read:  Same as serializable except that no previous key
~              	      locking is done.

read committed: write locks held until end of transaction.  Read locks
~               are released once the query in question no longer needs
~               them.

read uncommitted: no row locks are gotten.  The code still gets table
level intent locks to prevent concurrent ddl during the query.

In both the btree and the heap, delete are first executed by marking
a "deleted" bit.  This is to insure space on the page for abort, since
row level locking will allow other rows on the page to be modified
conncurrently with transaction executing the delete.  The system uses
a background daemon to schedule work after commit to reclaim the space
of the deleted rows.  A row marked deleted can be "purged" if one can
obtain a lock on it (if it was an uncommitted delete then the
transaction doing the commit would still have an exclusive lock on the
row).



Dibyendu Majumdar wrote:
| "Mike Matrigali" wrote:
|
|
|>I am happy to answer any specific questions about the store module you
|>have, or can find someone to answer them if I don't know.  I
|>think questions should be directed to the mail list so that everyone
|>can share in the discussion.  If I miss something feel free to ping
|>me to get an answer.  I am an IBM employee still working on the derby
|>codeline and was one of the original developers at Cloudscape when it
|>first began as a startup.
|
|
| Thank you very much, this will really help me to get started.
|
|
|>Is there any model/standard in apache or opensource that we could follow
|>or at least look at for a format for this kind of information.
|
|
| I have some thoughts on how I want to do this - these are given below.
|
|
|>I personally really like design information embedded in the actual code
|>it relates to, rather than having separate documents.
|
|
| I think that well documented code is always a good idea. And there is an
| argument that documentation closer to the code is more likely to be kept
| up-to-date. It is also a great help to anyone fixing the code.
|
| However, separate documents are also very useful. Specially because
you can
| illustrate concepts and design with nice diagrams and flow charts. It also
| enables people to grasp the design more easily, as things can be explained
| top-down rather than jumping straight to implementation details.
|
|
|>Is there any more specific area you are looking at, rather than just
|>store.
|
|
|>I usually divide the store into the following, some overlap:
|>o store module interface (java protocol used by other modules to
|>interact with store, this is actually fairly well documented in the
|>javadoc already available)
|>o access methods (btree/heap existing, future might be gist/rtree/...)
|>o recovery (both restart and crash recovery, logging, xa recovery,  ...)
|>o transactions (locking, isolation levels, xa transactions)
|>o data (actual physical storage of rows/columns on pages, page
|>management, cache  - the format of rows/columns/pages is reasonably
|>documented in the code - I think in the javadoc, but might just be in
|>comments)
|>o misc (sorting, encryption, ...)
|
|
| Well, I shall be documenting the whole of it., but will start with a
| particular item, and then move on to other items.
|
| I think that documentation should be at two levels.
|
| The more useful level is where ideas and concepts are explained. I am sure
| that Cloudscape was built upon ideas and concepts that had been invented
| before by others, and perhaps built its own innovations upon those
| foundations.
| So I want to explore the origins of those ideas, where did the ideas come
| from? How did Cloudscape implementation use those ideas? I'd like to give
| some examples to illustrate what I mean.
|
| If we take the transaction manager for a start. As far as I understand,
| Derby uses the ARIES algorithm for write-ahead logging. What I'd like to
| understand and document is how the Derby implementation of ARIES is
| different from the published algorithms. Does Derby use nested top-level
| actions or sub transactions to manage internal atomic operations? I think
| I saw a comment in the code somehere that Derby does not yet
| implement the transaction list in checkpoint records. I assume that Derby
| does page-oriented redo during recovery redo pass, and then
| page-oriented or logical undo during the undo pass.
|
| For the BTREE implementation, I'd like to understand whether Derby
| implements ARIES/IM or some other algorithm. How does it handle
| structural consistency of the tree during concurrent updates? What are
| the locking/latching implications of the algorithm used? Are accesses to
| the entire tree or subtree serialised during structural modifications?
| How is repeatable read ensured? Does Derby use data locking row or
| index row locking? Does Derby delete rows logically first, and
| physically afterwards? How does it determine that a deleted row can
| be physically removed, ie, the transaction that deleted the row has
| committed?
|
| I understand that Derby uses Log Actions to perform both normal
| operations and recovery operations. Thus a key insert is first encoded as
| a log operation, logged, and then executed. This is very neat.
|
| I also get the impression by reading the code that a BTREE is
| implemented on top of a regular table like structure. This is also
| very neat as it resuses all the code for handling rows, locking, etc.
|
| These are a few examples, and I could list several more. What would be
| very useful is if you and other IBM developers could identify the
algorithms
| that Derby uses. Perhaps you could provide references to published
| research papers, books, and patents. I could start by writing a high level
| overview of the concepts used by Derby, and gradually build up the
| detail by reading  the code, and by asking questions in this forum.
|
| The second level of documentation can be built from the comments
| in the code. This is the implementation level where I can describe how
| Derby actually implements something, what the file formats are, how the
| write ahead log is implementated, what a conglomerate is, what a
| record id and how it is different from slot id, etc.
|
| In terms of starting this level of documentation, I would first like to
| concentrate on the "raw" module, which is I suspect is at the heart
| of all other modules.
|
| Thanks and Regards
|
| Dibyendu
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8VntEpeslyHqPs0RAvamAKCz+Krfmgtb7d2e6MrU68Zg1jgSBACg6lbM
Ft/o3N++pWYHO7HzEZZzXQ0=
=l20l
-----END PGP SIGNATURE-----