You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Ashish Srivastava <de...@gmail.com> on 2005/01/18 06:23:01 UTC

Derby architecture/design documents

Hi, 
I am familier with Cloudscape since it was with Informix and I keep
track of each version of Cloudscape since then. But so far I am only
the user of Derby (Cloudscape). Now I feel I have enough java/db
knowledge that I can participate as a developer in Derby.

Can any one help me in finding the direction where to enter in to
Derby as a developer and can any one provide me the link where I can
get the architecture and design document of Derby?

Thanks in anticipation,

Re: Derby architecture/design documents

Posted by Jeffrey Lichtman <sw...@rcn.com>.

>What little documentation that exists for the store is in comments in
>the code.  As a startup we did not do much in the way of design documents.

We did produce a big architecture document. Does it still exist? What would 
have to be done to bring it up to date?


                        -        Jeff Lichtman
                                 swazoo@rcn.com
                                 Check out Swazoo Koolak's Web Jukebox at
                                 http://swazoo.com/

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

Thanks, Dibyendu, for taking the lead on this technical documentation. The
results of this effort will be a valuable contribution.

If I can help facilitate your efforts in any way, either online or offline,
let me know.

>
> Well, I shall be documenting the whole of it., but will start with a
> particular item, and then move on to other items.
>
.....
>
> In terms of starting this level of documentation, I would first like to
> concentrate on the "raw" module, which is I suspect is at the heart
> of all other modules.
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

"Mike Matrigali" wrote:

> Ok, that's a lot of questions - not going to get to all, but here's a
> start.  This posting is going to try to answer the btree questions.
> I'll do the recovery questions next.  The compare/contrast with ARIES
> papers will take some time, may not get to that right away.

Thank you - your post was very generous. It will take me some time to
absorb it and come back with questions.

Regards

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

> > Is there any way I can setup a debug version of Derby that I can step
> > through? I tried doing this with the default jar file, but the debugger
> > complains
> > that there is no line number information. I tried to find out how to
> > build Derby with debug enabled, but cannot find the relevant
information.
> >

RPost wrote:

>
> I got it working in JBuilder6 by adding all of the jars in the jars\insane
> directory to the project properties. JBuilder then lets me add the path to
> the source as derby\engine. Then I can step into all of the Derby source,
> set breakpoints in it, etc.
>
> Don't know how to do it in Eclipse. I'm trying to follow the discourse
> between the two of you and keep winding up in the bowels myself. But
unless
> you think I can help I was planning on staying out of you guys way since
you
> are making pretty progress in spite of the difficulties.

I managed to build Derby with debug on. Added the following line to
to tools/ant/properties/defaultcompiler.properties:
debug=on
I used Sun's JDK to build Derby.

I am now able to set breakpoints in Derby when I use Eclipse.
All I had to do was add derby.jar to the classpath, and let Eclipse
know where to find the source.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> Dibyendu
>
> Is there any way I can setup a debug version of Derby that I can step
> through? I tried doing this with the default jar file, but the debugger
> complains
> that there is no line number information. I tried to find out how to
> build Derby with debug enabled, but cannot find the relevant information.
>

I got it working in JBuilder6 by adding all of the jars in the jars\insane
directory to the project properties. JBuilder then lets me add the path to
the source as derby\engine. Then I can step into all of the Derby source,
set breakpoints in it, etc.

Don't know how to do it in Eclipse. I'm trying to follow the discourse
between the two of you and keep winding up in the bowels myself. But unless
you think I can help I was planning on staying out of you guys way since you
are making pretty progress in spite of the difficulties.

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:

> As you are seeing it is hard to keep to one topic as they are all
> related.

I guess that jumping straight into BTree implementation is going to be
tough going as it is built upon so many other things. However, I did not
want
to stop the flow ...

Do you think that it is better to start with the lower level stuff and
gradually move up?

One of the problems of looking at comments in a large system like
Derby is that following references is difficult. I find that it helps to use
Eclipse because when the mouse is pointed at a method or a class name,
Eclipse shows the relevant Javadoc. I just ignore the compile errors and
use Eclipse as a class browser.

Is there any way I can setup a debug version of Derby that I can step
through? I tried doing this with the default jar file, but the debugger
complains
that there is no line number information. I tried to find out how to
build Derby with debug enabled, but cannot find the relevant information.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

As you are seeing it is hard to keep to one topic as they are all
related.  I didn't want to get into the space stuff, but since
you asked, here are some comment directed at those:

In both btree and heap rows are only marked deleted and then sometime
later the space is reclaimed.  Here is what triggers that work.

"online" during BTREE split:
~    Whenever there is not enough room on a leaf to do an insert the
~    code attempts to find space on the leaf, by seeing if it can reclaim
~    any committed deletes on that leaf.  That work only requires the
~    latch on the leaf and row locks.  It is expected that most of the
~    space reclaim done in the btree goes through this path.  This is
~    the code you see in BTreeController.reclaim_deleted_rows.

BTREE post commit work:
~    Whenever a delete operation deletes the last row from a leaf
~    page then a
~    BtreePostCommit job is queued to be executed after the transaction
~    which did the delete commits.  This work currently requires a
~    table level lock as page merges have not been implemented to be
~    allowed concurrent with other operations.  I believe many db's don't
~    even try to do page merges except when called from some sort of
~    reorg utility.  If all rows on page are purged, then it will move
~    to the free list and perform a merge to the tree.

~    It is expected that the normal space reclamation happens with row
~    locks during btree split, which is why not much work has been done
~    to optimize btree post commit path.

Heap post commit work:
~    When a delete operation deletes the last row from any heap page then
~    a HeapPostCommit job is queued to be executed after the transaction
~    which did the delete commits.  This work requires latch on the
~    page being processed and row locks on rows being deleted.  If all
~    rows are purged then page is put on a free list, if not then page
~    is marked as having more space for future inserts.



Dibyendu Majumdar wrote:
| Mike,
|
| My initial comments/questions follow, more to come ...
|
|
|>Btree's are based on ARIES.
|
|
| Do you mean that BTree implementation uses ARIES style logging and
| recovery, or does it implement the concurrency and recovery logic of
| ARIES/IM? Or both?
|
| Reading the code, I cannot see that ARIES/IM is used apart from the
| data row locking.
|
|
|>Structure modifications are all
|>isolated from other operations through the use of latches.
|
|
| I am trying to understand the splitting logic which requires structural
| change. It seems to me that it works as follows:
|
| If a key cannot be inserted in a page, then check if parent has enough
space
| to insert the split key.
| If not, go to root and split the tree top-down along the path that
leads to
| the page that the new key is being inserted at.
| Retry inserting the key.
|
| The logic repeats until the key is inserted.
|
| The basic idea seems to be that the split is deferred until it is ensured
| that space is available in the parent page.
| The exact mechanism is not clear to me as I am unable to follow the logic
| which uses recursion, and seems quite complex.
|
| The split seems to be managed as an "internal transaction" which is
| described as follows:
|
| "Start an internal transaction.  An internal transaction is a completely
| separate transaction from the current user transaction.  All work done
| in the internal transaction must be physical (ie. it can be undone
|  physically by the rawstore at the page level, rather than logically
|  undone like btree insert/delete operations).  The rawstore guarantee's
| that in the case of a system failure all open Internal transactions are
| first undone in reverse order, and then other transactions are undone
| in reverse order.
| Internal transactions are meant to implement operations which, if
|  interupted before completion will cause logical operations like tree
| searches to fail.  This special undo order insures that the state of
| the tree is restored to a consistent state before any logical undo
| operation which may need to search the tree is performed."
|
| FYI, I am looking at the following methods:
|
|
org.apache.derby.impl.store.access.btree.BTreeController.start_xact_and_dosp
| lit()
|
org.apache.derby.impl.store.access.btree.BranchControlRow.restartSplitFor()
| org.apache.derby.impl.store.access.btree.BranchControlRow.splitFor()
| org.apache.derby.impl.store.access.btree.LeafControlRow.splitFor()
|
|
|>In both the btree and the heap, delete are first executed by marking
|>a "deleted" bit.  This is to insure space on the page for abort, since
|>row level locking will allow other rows on the page to be modified
|>conncurrently with transaction executing the delete.  The system uses
|>a background daemon to schedule work after commit to reclaim the space
|>of the deleted rows.  A row marked deleted can be "purged" if one can
|>obtain a lock on it (if it was an uncommitted delete then the
|>transaction doing the commit would still have an exclusive lock on the
|>row).
|
|
| I assume you are referring to:
|
| org.apache.derby.impl.store.access.btree.BTreePostCommit.
|
| The comment in the code says:
|
| "The current space reclamation algorithm requires a table level
| lock on the btree - this is mostly because the shrink algorithm
| is not multi-user.  This lock is requested NOWAIT as it does
| not want to impedede normal operation on the table.  If the lock
| were to wait then the current lock manager livelock algorithm
| would block all subsequent lock requests on this btree even if
| they are compatible with the current holder of the lock."
|
| But I also find that deleted rows are also reclaimed during normal insert
| operations. For example, in:
|
|
org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_row
| s().
|
| The comment for this method says:
|
| "Attempt to reclaim committed deleted rows from the page.
| Get exclusive latch on page, and then loop backward through
| page searching for deleted rows which are committed.  The routine
| assumes that it is called from a transaction which cannot have
| deleted any rows on the page.  For each deleted row on the page
| it attempts to get an exclusive lock on the deleted row, NOWAIT.
| If it succeeds, and since this row did not delete the row then the
| row must have been deleted by a transaction which has committed, so
| it is safe to purge the row.  It then purges the row from the page.
| Note that this routine may remove all rows from the page, it will not
| attempt a merge in this situation.  This is because this routine is
| called from split which is attempting an insert on the given page, so
| it would be a waste to merge the page only to split it again to allow
| the insert of the row causing the split."
|
|
| Regards
|
| Dibyendu
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB9EaGEpeslyHqPs0RArj9AKCG5pVPY1U9yIvaB7f6vZSfXM9lxwCeODHt
uTYEODNS05NE0N/9tMpudXg=
=Niv+
-----END PGP SIGNATURE-----

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike,

My initial comments/questions follow, more to come ...

> Btree's are based on ARIES.

Do you mean that BTree implementation uses ARIES style logging and
recovery, or does it implement the concurrency and recovery logic of
ARIES/IM? Or both?

Reading the code, I cannot see that ARIES/IM is used apart from the
data row locking.

> Structure modifications are all
> isolated from other operations through the use of latches.

I am trying to understand the splitting logic which requires structural
change. It seems to me that it works as follows:

If a key cannot be inserted in a page, then check if parent has enough space
to insert the split key.
If not, go to root and split the tree top-down along the path that leads to
the page that the new key is being inserted at.
Retry inserting the key.

The logic repeats until the key is inserted.

The basic idea seems to be that the split is deferred until it is ensured
that space is available in the parent page.
The exact mechanism is not clear to me as I am unable to follow the logic
which uses recursion, and seems quite complex.

The split seems to be managed as an "internal transaction" which is
described as follows:

"Start an internal transaction.  An internal transaction is a completely
separate transaction from the current user transaction.  All work done
in the internal transaction must be physical (ie. it can be undone
 physically by the rawstore at the page level, rather than logically
 undone like btree insert/delete operations).  The rawstore guarantee's
that in the case of a system failure all open Internal transactions are
first undone in reverse order, and then other transactions are undone
in reverse order.
Internal transactions are meant to implement operations which, if
 interupted before completion will cause logical operations like tree
searches to fail.  This special undo order insures that the state of
the tree is restored to a consistent state before any logical undo
operation which may need to search the tree is performed."

FYI, I am looking at the following methods:

org.apache.derby.impl.store.access.btree.BTreeController.start_xact_and_dosp
lit()
org.apache.derby.impl.store.access.btree.BranchControlRow.restartSplitFor()
org.apache.derby.impl.store.access.btree.BranchControlRow.splitFor()
org.apache.derby.impl.store.access.btree.LeafControlRow.splitFor()

> In both the btree and the heap, delete are first executed by marking
> a "deleted" bit.  This is to insure space on the page for abort, since
> row level locking will allow other rows on the page to be modified
> conncurrently with transaction executing the delete.  The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows.  A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the
> row).

I assume you are referring to:

org.apache.derby.impl.store.access.btree.BTreePostCommit.

The comment in the code says:

"The current space reclamation algorithm requires a table level
lock on the btree - this is mostly because the shrink algorithm
is not multi-user.  This lock is requested NOWAIT as it does
not want to impedede normal operation on the table.  If the lock
were to wait then the current lock manager livelock algorithm
would block all subsequent lock requests on this btree even if
they are compatible with the current holder of the lock."

But I also find that deleted rows are also reclaimed during normal insert
operations. For example, in:

org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_row
s().

The comment for this method says:

"Attempt to reclaim committed deleted rows from the page.
Get exclusive latch on page, and then loop backward through
page searching for deleted rows which are committed.  The routine
assumes that it is called from a transaction which cannot have
deleted any rows on the page.  For each deleted row on the page
it attempts to get an exclusive lock on the deleted row, NOWAIT.
If it succeeds, and since this row did not delete the row then the
row must have been deleted by a transaction which has committed, so
it is safe to purge the row.  It then purges the row from the page.
Note that this routine may remove all rows from the page, it will not
attempt a merge in this situation.  This is because this routine is
called from split which is attempting an insert on the given page, so
it would be a waste to merge the page only to split it again to allow
the insert of the row causing the split."


Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:

> Some information to expand on the topics below (not all of them, but a
> start), can I suggest you start to maintain some sort of TODO list for
> gathering information, I will update as I get a time:

Will do. I hope to produce some stuff this week, containing all the material
you have described so far.

Thank you again for taking the time to explain all this.

I will follow up with questions once I have converted what is already
available into some sort of documentation.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I am going to defer on page version vs. LSN question, but at least
mention some guesses, not researched info.  You are right
about what exists.  I spoke with some colleagues and the best we can
come up with is that the implementor wanted to separate the page and
the log, in case we ever did a different log format.  I will try to
do some more research here.  I also vaguely remember the implementor
mentioning if we ever wanted to implement the LSN on the page, we
had space to do so.  It may simply have been easier to code the page
versions, since in the current system the LSN is the address in the log
file (which and it may not be available when the code wants to write it
on the page).

As you say in derby all the log records are associated with a page, and
thus have a page version number.  That page version number in the log
is compared with the page version number of the page during redo to
determine if the log record needs to be applied.  This also has helped
us in the past to catch some bugs as we can sanity check during redo
that the page is progressing along the expected path, ie. it is a bug
during redo to be applying a page version 10 log record to page that is
at page version 8.  I haven't seen this sanity check in many years, but
was useful when the product was first coded.

The original requirement for the identifier of a record in cloudscape
was that it must be a unique id, associated with the row, for the
lifetime of the row.  The identifier was meant to survive table
reorganization also.  Originally, cloudscape was being designed as
a object relational system, and the plan was that these record id's
might be used as object id's (or some part of an object id), that might
be exported outside the db.  This use really never came to pass.  For
row level locking, the record identifier is used as the lock, so it is
important that the id never get reused.

In derby the extra level of indirection was added to meet the original
design goals.   The slot table entry could be used if we never reclaimed
a slot table entry, currently in derby the slot table entries are
automatically reclaimed as part of the space management algorithms.
During runtime the system maintains runtime information mapping id's to
slot table entries - but as you say it would be more efficient if the
id pointed directly at the offset table.  It is a space vs. runtime
tradeoff.

Dibyendu Majumdar wrote:

> Mike,
> 
> A couple of questions for you.
> 
> While documenting the page formats, I noticed that Derby does not persist
> the LogInstant (LSN) associated with a page. In a classic ARIES
> implementation, each page is supposed to be tagged with the LSN of the log
> record that last updated it, so that during recovery redo, the LSN can be
> used to determine if a particular log record has already been applied to the
> page.
> 
> Derby appears to use PageVersion for this purpose, which is persisted and is
> also present in log records. Any reason why PageVersion is used and not
> LogInstant?
> 
> Another thing that has puzzled me is the use of RecordIds. I would have
> assumed that the purpose of the slot table in a page is provide a stable
> TupleId (as described in the classic TPCT book by Jim Gray and Andreas
> Reuter), so that there is freedom to move records around in the page, and
> still have a stable TupleId that does not change over time. Also, the
> slot-no allows a particular record to be located efficiently. Given that a
> page in Derby has a slot table, why is the slot-no not used as part of the
> RecordId? Using a separate id is obviously inefficient, as it must be
> searched for.
> 
> Thanks and Regards
> 
> Dibyendu
> 
> 
> 
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike,

A couple of questions for you.

While documenting the page formats, I noticed that Derby does not persist
the LogInstant (LSN) associated with a page. In a classic ARIES
implementation, each page is supposed to be tagged with the LSN of the log
record that last updated it, so that during recovery redo, the LSN can be
used to determine if a particular log record has already been applied to the
page.

Derby appears to use PageVersion for this purpose, which is persisted and is
also present in log records. Any reason why PageVersion is used and not
LogInstant?

Another thing that has puzzled me is the use of RecordIds. I would have
assumed that the purpose of the slot table in a page is provide a stable
TupleId (as described in the classic TPCT book by Jim Gray and Andreas
Reuter), so that there is freedom to move records around in the page, and
still have a stable TupleId that does not change over time. Also, the
slot-no allows a particular record to be located efficiently. Given that a
page in Derby has a slot table, why is the slot-no not used as part of the
RecordId? Using a separate id is obviously inefficient, as it must be
searched for.

Thanks and Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Some information to expand on the topics below (not all of them, but a
start), can I suggest you start to maintain some sort of TODO list for
gathering information, I will update as I get a time:

container/conglomerate/table - This terminology comes from the original
modular design of the system.  The store was really 2 modules: the
lower level raw store and then the upper level access.

Raw store uses containers to store rows.  Currently these containers
always map to a single file in the seg0 directory of the database.

Access provides conglomerates as the interface to the it's clients for
storing rows.  Currently there is a one to one mapping between a
conglomerate and a container.

The language level implements SQL tables and indexes using Access
provided Conglomerates.

The layers allow for some decision in the future if derby is ever to
support tables and/or indexes spread across multiple disks.  The
implementation could either happen at the language layer, by spreading
the table across multiple conglomerates; or it could happen at the store
 level by spreading it across multiple containers.  My opinion is that
if the table is fragmented by key, then language should do it as it best
understands the key's and indexes involved.  If a raw partitioning of
the data is desired then it would be best done in the store (but I
actually think that a raw partitioning is best done below the database
itself by the OS or hardware).

Latches - Latches are used in derby.  They are short term locks on the
page in the buffer cache.  They are requested anytime the raw store
needs to read/write information on the page.  They are short term and
never held during any operation which can wait (like an I/O or a lock
request).  Latches are implemented using the Derby lock manager.

page, row and field formats - check out comments at top of
java/engine/org/apache/derby/impl/store/raw/data/StoredPage.java and
StoredFieldHeader.java, let me know if you need more info.

Handling of large rows -
The terminolgy used in raw store is long rows and long columns.  A
column is long if it can't fit on a single page.  A row is long if all
of it's columns can't fit on a single page.  A long column is marked as
long in the base row, and it's field contains a pointer to a chain of
other rows in the same container with contain the data of the row.  Each
of the subsequent rows is on a page to itself.  Each subsquent row,
except for the last piece has 2 columns, the first is the next segment
of the row and the second is the pointer to the the following segment.
The last segment only has the data segment.

Similarly for a long row, the segment of the row which fits on the page
is left there, and a pointer column is added at the end of the row.  It
points to another row in the same container on a different page.  That
row will contain the next set of columns and a continuation pointer if
necessary.  The overflow portion will be on an "overflow" page, and that
page may have overflow portions of other rows on it (unlike overflow
columns).

Dibyendu Majumdar wrote:

> Here are some ideas on how I would like to structure the documentation.
> 
> 
> For every topic, I'd like to create:
> 
> Introduction to concepts - A general introduction to the topic.
> Derby implementation details - This will be main substance of the document
> where I will describe how Derby works.
> References - these will point to published papers, books, etc. that discuss
> the topic in question.
> 
> In terms of topics, here is what I have come up with (let me know if things
> should be added):
> 
> Terms - this will be a glossary of Derby specific terms so that I don't have
> to keep explaining the same terms in every document.
> 
> Row ID - How rows are identified - RecordId and SlotId.
> 
> Row management within a page - storage format of rows, slot table, etc.
> 
> Handling of large rows - how does Derby handle rows that won't fit into one
> page.
> 
> Container - what is a container?
> 
> Space management in Containers - how is space management implemented? How
> does Derby locate an empty page?
> 
> Latches - are latches used by Derby? How are they implemented?
> 
> Lock management - description of lock management, lock conflict resolution,
> deadlocks, lock escalation.
> 
> Buffer cache - how does Derby implement the buffer cache, and what is
> interaction between Buffer cache and Log, and Buffer cache and Transaction
> manager.
> 
> Write ahead log - description of how the log is implemented - this would
> mainly cover the format of log records, how log records are represented in
> memory and in disk, log archiving, checkpointing, etc.
> 
> Transactions - how is an Xid allocated? What is the representation of a
> transaction? How is a transaction related to a thread?
> 
> Transaction Manager - description of how Derby implements ARIES. What
> happens at system restart. How rollbacks and commits work. Different types
> of log records used by the transaction manager - such as do, redo, undo,
> compensation, checkpoint, etc.
> 
> Row locking in tables - how are rows locked? What happens when a row spans
> pages?
> 
> Row recovery - Do/Redo/Undo actions for rows - inserts, updates, deletes.
> 
> BTree - page organisation and structure
> 
> BTree - concurrency - how does Derby handle concurrent updates to the tree -
> inserts and deletes? How are structural changes serialised? Do updates block
> readers (not as in locking but while the change is being made) or can they
> progress concurrently?
> 
> BTree - locking - data row locking or index row loacking? Is next-key
> locking used for serializable reads?
> 
> BTree - recovery - Do/Redo/Undo actions for key inserts, updates, deletes.
> 
> Row scans on tables - is this handled by "store"?
> 
> Row scans in BTrees - is this handled by "store"?
> 
> Conglomerates - what is a Conglomerate?
> 
> 
> 
> 
> 
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Here are some ideas on how I would like to structure the documentation.


For every topic, I'd like to create:

Introduction to concepts - A general introduction to the topic.
Derby implementation details - This will be main substance of the document
where I will describe how Derby works.
References - these will point to published papers, books, etc. that discuss
the topic in question.

In terms of topics, here is what I have come up with (let me know if things
should be added):

Terms - this will be a glossary of Derby specific terms so that I don't have
to keep explaining the same terms in every document.

Row ID - How rows are identified - RecordId and SlotId.

Row management within a page - storage format of rows, slot table, etc.

Handling of large rows - how does Derby handle rows that won't fit into one
page.

Container - what is a container?

Space management in Containers - how is space management implemented? How
does Derby locate an empty page?

Latches - are latches used by Derby? How are they implemented?

Lock management - description of lock management, lock conflict resolution,
deadlocks, lock escalation.

Buffer cache - how does Derby implement the buffer cache, and what is
interaction between Buffer cache and Log, and Buffer cache and Transaction
manager.

Write ahead log - description of how the log is implemented - this would
mainly cover the format of log records, how log records are represented in
memory and in disk, log archiving, checkpointing, etc.

Transactions - how is an Xid allocated? What is the representation of a
transaction? How is a transaction related to a thread?

Transaction Manager - description of how Derby implements ARIES. What
happens at system restart. How rollbacks and commits work. Different types
of log records used by the transaction manager - such as do, redo, undo,
compensation, checkpoint, etc.

Row locking in tables - how are rows locked? What happens when a row spans
pages?

Row recovery - Do/Redo/Undo actions for rows - inserts, updates, deletes.

BTree - page organisation and structure

BTree - concurrency - how does Derby handle concurrent updates to the tree -
inserts and deletes? How are structural changes serialised? Do updates block
readers (not as in locking but while the change is being made) or can they
progress concurrently?

BTree - locking - data row locking or index row loacking? Is next-key
locking used for serializable reads?

BTree - recovery - Do/Redo/Undo actions for key inserts, updates, deletes.

Row scans on tables - is this handled by "store"?

Row scans in BTrees - is this handled by "store"?

Conglomerates - what is a Conglomerate?

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

RPost wrote:
> For some reason I interpreted Mike's comments to mean that Derby would not
> obtain any latches until AFTER it had recursed up the tree to find the
> highest node that needed to be split.
> 
> Then it would:
> 
> 1. obtain latches on that highest-most parent and child node
> 2. perform the split as an internal transaction
> 3. release the latches on the parent and child (why not keep the child latch
> if it is needed)
> 4. then move back down the tree to the child, which now becomes the parent
> and repeats the process at step 1 above until completed.
> 
> This approach only obtains latches immediately before they are used to
> perform a split.

I am not sure whether it starts from the root or recurses up the tree 
one level at a time. Mike, please help!

Regards

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Dibyendu Majumdar" wrote:

> The question is what happens if parent P is also full and needs to be
> split. This is where things get really complicated. If my understanding
> is correct, Derby solves this in a simple and elegant way.
>
> It releases the latches on C and P, and starts the split at P with the
> separator key. So, at that point, P becomes C, and P's parent, let us
> say Q, becomes P.
>
> I think that this is repeated recursively up the tree until the original
> P has been split with room to insert the separator key. Then Derby
> proceeds with the original split.
>

This description would work and may indeed be what Mike meant.

For some reason I interpreted Mike's comments to mean that Derby would not
obtain any latches until AFTER it had recursed up the tree to find the
highest node that needed to be split.

Then it would:

1. obtain latches on that highest-most parent and child node
2. perform the split as an internal transaction
3. release the latches on the parent and child (why not keep the child latch
if it is needed)
4. then move back down the tree to the child, which now becomes the parent
and repeats the process at step 1 above until completed.

This approach only obtains latches immediately before they are used to
perform a split.

Is it necessary to latch a page in order to determine if the page is full
and must be split? If not, why obtain latches before recursing to the top
page that must be split?

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I will be submitting compare/contrast paper with ARIES/IM.  hopefully
the following will answer this question.

Unlike ARIES/IM SMO (structure modification operations - ie. split),
only happen top down.  This is not as concurrent as bottom up
ARIES/IM, but was simpler.  Do note though that as in ARIES IM "Not more
than 2 index pages are held latched simultaneously at anytime.  In order
to improve concurrency and to avoid deadlocks involving latches, even
those latches are not held while waiting for a lock wich is not
immediately grantable.  No data page latch is held or acquired during an
index access.  Latch coupliing is used while traversing the triee - ie.
the latch on a parent page is held while requesting a latch on a child
page."

So simple case of split exclusive latch is held on P (parent), and
C (child).  If there is room in P, then new page C1 (child right) is
created, new descriminator is inserted into P, rows moved C1.  Latches
are held on P and C for duration of split, and then released.

Hard case is P does not have room for descriminator key.  In this case
all latches are released, and derby does a split pass from top to
bottom, and will split internal node that does not have room for the
decrimator key.  Note this may result in more splits than necessary for
this particular insert, but the assumption is that the splits will have
to happen eventually anyway.  After this split pass is done, the search
for the insert starts again from top down, but it must once again check
for space because it has given up all it's latches and someone else may
have gotten to the space first.

Optimization is possible to remember C and see if it is right location,
and/or use sideway pointers to search right rather than do research of
tree.

Dibyendu Majumdar wrote:

> Philip Bohannon wrote some time back:
> 
>> The classic problem here is a split taking place during a search.  Say  a
>> page C contains 50-100, and C's parent in the tree is P, which has an
>> entry for 50 and 101, with the entry for 50 pointing to C.
>>
>> Now an inserter comes in, inserts 51 and fills up C (or fills up C's
>> child, causes a split, and that fills up C).  So C splits into C and C',
>> which is added to the right of C.
>>
>> Since the inserter can't get a lock on P while holding a lock on C
>> (deadlock), it releases the lock on P.  At this point, a reader comes
>> down
>> the tree looking for 99, looks at P and gets a pointer to C not C'. Now
>> inserter starts waiting for a lock on P.  When reader gets C, 99 has been
>> moved off to C'.
>>
>> It would be interesting to know how Derby handles this (for example,
>> [Yao]
>> proposes having a pointer from C to C', and I forget what Aries IM
>> does at
>> the moment, but I think it starts over at the top if you fail to find a
>> key at the rightmost point in the page...).
> 
> 
> Philip,
> 
> I don't fully understand how Derby works, but from what I have seen so
> far, I think that Derby avoids above situation as follows:
> 
> It latches the parent P first and then the child C exclusively - and
> only then starts the split. The searcher has to wait until the split is
> over.
> 
> In Mike's words:
> 
> [In order to prevent deadlocks latches requested while holding other
> latches are always requested top/down and left to right.  Btree splits
> are always left to right.  If for any reason the code needs to break
> this protocol then it will first request the latch NOWAIT and if it
> can't get the latch it will release all current latches and wait for the
> latch it is trying to get, and then after obtaining it go about getting
> the other latches it needs for the particular operation.  While
> traversing down the tree Derby may hold 2 latches: one on parent and one
> on child.]
> 
> The question is what happens if parent P is also full and needs to be
> split. This is where things get really complicated. If my understanding
> is correct, Derby solves this in a simple and elegant way.
> 
> It releases the latches on C and P, and starts the split at P with the
> separator key. So, at that point, P becomes C, and P's parent, let us
> say Q, becomes P.
> 
> I think that this is repeated recursively up the tree until the original
> P has been split with room to insert the separator key. Then Derby
> proceeds with the original split.
> 
> Perhaps Mike or someone else can confirm or fill in with more details.
> 
> I have not seen above algorithm in any of the papers I have read - but
> then, my reading is limited to following papers:
> 
> C.Mohan and Frank Levine. ARIES/IM: an efficient and high concurrency
> index management method with write-ahead logging. ACM SIGMOD Record. V21
> N2, P371-380, June 1 1992.
> 
> Jim Gray and Andreas Reuter. Chapter 15: Access Paths. Transaction
> Processing: Concepts and Techniques. Morgan Kaufmann Publishers, 1993
> 
> Philip L. Lehman and S.Bing Yao. Efficient Locking for Concurrent
> Operations on B-Trees. Readings in Database Systems, Third Edition,
> 1998. Morgan Kaufmann Publishers.
> 
> David Lomet and Betty Salzburg. Access method concurrency with recovery.
> ACM SIGMOD, V21 N2, p351-360, June 1 1992.
> 
> 
> Regards
> 
> Dibyendu
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Philip Bohannon wrote some time back:
> The classic problem here is a split taking place during a search.  Say  a
> page C contains 50-100, and C's parent in the tree is P, which has an
> entry for 50 and 101, with the entry for 50 pointing to C.
> 
> Now an inserter comes in, inserts 51 and fills up C (or fills up C's
> child, causes a split, and that fills up C).  So C splits into C and C',
> which is added to the right of C.
> 
> Since the inserter can't get a lock on P while holding a lock on C
> (deadlock), it releases the lock on P.  At this point, a reader comes down
> the tree looking for 99, looks at P and gets a pointer to C not C'. Now
> inserter starts waiting for a lock on P.  When reader gets C, 99 has been
> moved off to C'.
> 
> It would be interesting to know how Derby handles this (for example, [Yao]
> proposes having a pointer from C to C', and I forget what Aries IM does at
> the moment, but I think it starts over at the top if you fail to find a
> key at the rightmost point in the page...).

Philip,

I don't fully understand how Derby works, but from what I have seen so 
far, I think that Derby avoids above situation as follows:

It latches the parent P first and then the child C exclusively - and 
only then starts the split. The searcher has to wait until the split is 
over.

In Mike's words:

[In order to prevent deadlocks latches requested while holding other 
latches are always requested top/down and left to right.  Btree splits 
are always left to right.  If for any reason the code needs to break 
this protocol then it will first request the latch NOWAIT and if it 
can't get the latch it will release all current latches and wait for the 
latch it is trying to get, and then after obtaining it go about getting 
the other latches it needs for the particular operation.  While 
traversing down the tree Derby may hold 2 latches: one on parent and one 
on child.]

The question is what happens if parent P is also full and needs to be 
split. This is where things get really complicated. If my understanding 
is correct, Derby solves this in a simple and elegant way.

It releases the latches on C and P, and starts the split at P with the 
separator key. So, at that point, P becomes C, and P's parent, let us 
say Q, becomes P.

I think that this is repeated recursively up the tree until the original 
P has been split with room to insert the separator key. Then Derby 
proceeds with the original split.

Perhaps Mike or someone else can confirm or fill in with more details.

I have not seen above algorithm in any of the papers I have read - but 
then, my reading is limited to following papers:

C.Mohan and Frank Levine. ARIES/IM: an efficient and high concurrency 
index management method with write-ahead logging. ACM SIGMOD Record. V21 
N2, P371-380, June 1 1992.

Jim Gray and Andreas Reuter. Chapter 15: Access Paths. Transaction 
Processing: Concepts and Techniques. Morgan Kaufmann Publishers, 1993

Philip L. Lehman and S.Bing Yao. Efficient Locking for Concurrent 
Operations on B-Trees. Readings in Database Systems, Third Edition, 
1998. Morgan Kaufmann Publishers.

David Lomet and Betty Salzburg. Access method concurrency with recovery. 
ACM SIGMOD, V21 N2, p351-360, June 1 1992.


Regards

Dibyendu

RE: Derby architecture/design documents

Posted by Philip Bohannon <bo...@lucent.com>.

<delurk>
Wow, super help.

(Hi, my name is Philip Bohannon.  I work at Bell Labs and have built and
debugged some multi-level concurrency control & recovery schemes in the
past.  Thanks to Mike for peppering you with so many coherent
questions...)

I am adding a couple of questions and comments below.  One general comment
is that it might help to add an occasional pointer to the code; that is,
mention which class/methods implement the algorithm you are outlining.
With an overview like this, it will be vastly more pleasant perusing
around!


> -----Original Message-----
> From: Mike Matrigali [mailto:mikem_app@sbcglobal.net]
> Sent: Friday, January 21, 2005 2:37 PM
> To: Derby Development
> Subject: Re: Derby architecture/design documents
>
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ok, that's a lot of questions - not going to get to all, but here's a
> start.  This posting is going to try to answer the btree questions.
> I'll do the recovery questions next.  The compare/contrast with ARIES
> papers will take some time, may not get to that right away.
>
> Your suggestion of starting at high level, comparing/contrasting with
> published research should work well with the btree and recovery design.
> As you stated both of these areas in derby are based on the ARIES
> algorithms.
> I will schedule some time in the future to look carefully at those
> papers and produce the compare/contrast for those 2 areas.
>
> Here are some answers off the top of my head for the btree questions,
> that area is most familar to me:
>
> Btree's are based on ARIES.  The btree is a tree of pages, linked to
> one another in the standard way.  Derby uses latches on pages to get
> exclusive access to the page while reading or writing the page (Derby
> only uses exclusive latches, no shared latches are used).  In order
> to prevent deadlocks latches requested while holding other latches are
> always requested top/down and left to right.  Btree splits are always
> left to right.  If for any reason the code needs to break this protocol
> then it will first request the latch NOWAIT and if it can't get the
> latch it will release all current latches and wait for the latch it is
> trying to get, and then after obtaining it go about getting the other
> latches it needs for the particular operation.  While traversing down
> the tree Derby may hold 2 latches: one on parent and one on child.  It
> then continues doing "ladder" latching down the tree releasing the
> highest node when it has successfully got a new lower node latch.
> Latches are short term, only held while reading/modifying the page,
> never held while an I/O is happening.  Structure modifications are all
> isolated from other operations through the use of latches.

The classic problem here is a split taking place during a search.  Say  a
page C contains 50-100, and C's parent in the tree is P, which has an
entry for 50 and 101, with the entry for 50 pointing to C.

Now an inserter comes in, inserts 51 and fills up C (or fills up C's
child, causes a split, and that fills up C).  So C splits into C and C',
which is added to the right of C.

Since the inserter can't get a lock on P while holding a lock on C
(deadlock), it releases the lock on P.  At this point, a reader comes down
the tree looking for 99, looks at P and gets a pointer to C not C'. Now
inserter starts waiting for a lock on P.  When reader gets C, 99 has been
moved off to C'.

It would be interesting to know how Derby handles this (for example, [Yao]
proposes having a pointer from C to C', and I forget what Aries IM does at
the moment, but I think it starts over at the top if you fail to find a
key at the rightmost point in the page...).

>
> Derby uses data only locking for it's logical row level locking.  All
> isolation level implementation is done using logical locks (Derby has
> no non-locking isolation support like versioning).  A quick rundown of
> how locks are used to implement isolation level follows, for this
> discussion I will use the jdbc isolation notation.  In data only locking
> the key for all locks whether obtained in the btree or in the base table
> is the address of the base table row.  In the btree case the address of
> the base table row is always the last column of the leaf index entry.
:
> serializable: Derby uses "previous key" locking to prevent phantoms from
> ~              being inserted into a range being accessed by a scan.  A
> ~               	      scan always locks the index row just "previous"
to
> the  			      first entry that satisfies the scan condition and then
> all 	      index entries going forward until the end of the scan.
> ~    All row locks held until end of transaction.  Inserts also
> ~              get previous key locks, which will block if the existing
> ~              row is locked.
>
> repeatable read:  Same as serializable except that no previous key
> ~              	      locking is done.
>
> read committed: write locks held until end of transaction.  Read locks
> ~               are released once the query in question no longer needs
> ~               them.
>
> read uncommitted: no row locks are gotten.  The code still gets table
> level intent locks to prevent concurrent ddl during the query.
>
> In both the btree and the heap, delete are first executed by marking
> a "deleted" bit.  This is to insure space on the page for abort, since
> row level locking will allow other rows on the page to be modified
> conncurrently with transaction executing the delete.  The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows.  A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the
> row).

One small note: at recovery time usually such locks are not held.  Thus it
would be required not to restart this garbage collector until all aborted
transactions are rolled back.  And of course, the garbage collector would
need to get the lock, then re-check the bit to see if it is still 1.

Of course, I have no reason to think Derby doesn't handle such cases, but
asking the questions will at least help me learn :).

</delurk>

Philip Bohannon
Computing Sciences Research Center
Lucent Technologies -- Bell Laboratories

>
>
>
> Dibyendu Majumdar wrote:
> | "Mike Matrigali" wrote:
> |
> |
> |>I am happy to answer any specific questions about the store module you
> |>have, or can find someone to answer them if I don't know.  I
> |>think questions should be directed to the mail list so that everyone
> |>can share in the discussion.  If I miss something feel free to ping
> |>me to get an answer.  I am an IBM employee still working on the derby
> |>codeline and was one of the original developers at Cloudscape when it
> |>first began as a startup.
> |
> |
> | Thank you very much, this will really help me to get started.
> |
> |
> |>Is there any model/standard in apache or opensource that we could
follow
> |>or at least look at for a format for this kind of information.
> |
> |
> | I have some thoughts on how I want to do this - these are given below.
> |
> |
> |>I personally really like design information embedded in the actual
code
> |>it relates to, rather than having separate documents.
> |
> |
> | I think that well documented code is always a good idea. And there is
an
> | argument that documentation closer to the code is more likely to be
kept
> | up-to-date. It is also a great help to anyone fixing the code.
> |
> | However, separate documents are also very useful. Specially because
> you can
> | illustrate concepts and design with nice diagrams and flow charts. It
also
> | enables people to grasp the design more easily, as things can be
explained
> | top-down rather than jumping straight to implementation details.
> |
> |
> |>Is there any more specific area you are looking at, rather than just
> |>store.
> |
> |
> |>I usually divide the store into the following, some overlap:
> |>o store module interface (java protocol used by other modules to
> |>interact with store, this is actually fairly well documented in the
> |>javadoc already available)
> |>o access methods (btree/heap existing, future might be gist/rtree/...)
> |>o recovery (both restart and crash recovery, logging, xa recovery,
...)
> |>o transactions (locking, isolation levels, xa transactions)
> |>o data (actual physical storage of rows/columns on pages, page
> |>management, cache  - the format of rows/columns/pages is reasonably
> |>documented in the code - I think in the javadoc, but might just be in
> |>comments)
> |>o misc (sorting, encryption, ...)
> |
> |
> | Well, I shall be documenting the whole of it., but will start with a
> | particular item, and then move on to other items.
> |
> | I think that documentation should be at two levels.
> |
> | The more useful level is where ideas and concepts are explained. I am
sure
> | that Cloudscape was built upon ideas and concepts that had been
invented
> | before by others, and perhaps built its own innovations upon those
> | foundations.
> | So I want to explore the origins of those ideas, where did the ideas
come
> | from? How did Cloudscape implementation use those ideas? I'd like to
give
> | some examples to illustrate what I mean.
> |
> | If we take the transaction manager for a start. As far as I
understand,
> | Derby uses the ARIES algorithm for write-ahead logging. What I'd like
to
> | understand and document is how the Derby implementation of ARIES is
> | different from the published algorithms. Does Derby use nested
top-level
> | actions or sub transactions to manage internal atomic operations? I
think
> | I saw a comment in the code somehere that Derby does not yet
> | implement the transaction list in checkpoint records. I assume that
Derby
> | does page-oriented redo during recovery redo pass, and then
> | page-oriented or logical undo during the undo pass.
> |
> | For the BTREE implementation, I'd like to understand whether Derby
> | implements ARIES/IM or some other algorithm. How does it handle
> | structural consistency of the tree during concurrent updates? What are
> | the locking/latching implications of the algorithm used? Are accesses
to
> | the entire tree or subtree serialised during structural modifications?
> | How is repeatable read ensured? Does Derby use data locking row or
> | index row locking? Does Derby delete rows logically first, and
> | physically afterwards? How does it determine that a deleted row can
> | be physically removed, ie, the transaction that deleted the row has
> | committed?
> |
> | I understand that Derby uses Log Actions to perform both normal
> | operations and recovery operations. Thus a key insert is first encoded
as
> | a log operation, logged, and then executed. This is very neat.
> |
> | I also get the impression by reading the code that a BTREE is
> | implemented on top of a regular table like structure. This is also
> | very neat as it resuses all the code for handling rows, locking, etc.
> |
> | These are a few examples, and I could list several more. What would be
> | very useful is if you and other IBM developers could identify the
> algorithms
> | that Derby uses. Perhaps you could provide references to published
> | research papers, books, and patents. I could start by writing a high
level
> | overview of the concepts used by Derby, and gradually build up the
> | detail by reading  the code, and by asking questions in this forum.
> |
> | The second level of documentation can be built from the comments
> | in the code. This is the implementation level where I can describe how
> | Derby actually implements something, what the file formats are, how
the
> | write ahead log is implementated, what a conglomerate is, what a
> | record id and how it is different from slot id, etc.
> |
> | In terms of starting this level of documentation, I would first like
to
> | concentrate on the "raw" module, which is I suspect is at the heart
> | of all other modules.
> |
> | Thanks and Regards
> |
> | Dibyendu
> |
> |
> |
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.5 (MingW32)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQFB8VntEpeslyHqPs0RAvamAKCz+Krfmgtb7d2e6MrU68Zg1jgSBACg6lbM
> Ft/o3N++pWYHO7HzEZZzXQ0=
> =l20l
> -----END PGP SIGNATURE-----
>

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ok, that's a lot of questions - not going to get to all, but here's a
start.  This posting is going to try to answer the btree questions.
I'll do the recovery questions next.  The compare/contrast with ARIES
papers will take some time, may not get to that right away.

Your suggestion of starting at high level, comparing/contrasting with
published research should work well with the btree and recovery design.
As you stated both of these areas in derby are based on the ARIES
algorithms.
I will schedule some time in the future to look carefully at those
papers and produce the compare/contrast for those 2 areas.

Here are some answers off the top of my head for the btree questions,
that area is most familar to me:

Btree's are based on ARIES.  The btree is a tree of pages, linked to
one another in the standard way.  Derby uses latches on pages to get
exclusive access to the page while reading or writing the page (Derby
only uses exclusive latches, no shared latches are used).  In order
to prevent deadlocks latches requested while holding other latches are
always requested top/down and left to right.  Btree splits are always
left to right.  If for any reason the code needs to break this protocol
then it will first request the latch NOWAIT and if it can't get the
latch it will release all current latches and wait for the latch it is
trying to get, and then after obtaining it go about getting the other
latches it needs for the particular operation.  While traversing down
the tree Derby may hold 2 latches: one on parent and one on child.  It
then continues doing "ladder" latching down the tree releasing the
highest node when it has successfully got a new lower node latch.
Latches are short term, only held while reading/modifying the page,
never held while an I/O is happening.  Structure modifications are all
isolated from other operations through the use of latches.

Derby uses data only locking for it's logical row level locking.  All
isolation level implementation is done using logical locks (Derby has
no non-locking isolation support like versioning).  A quick rundown of
how locks are used to implement isolation level follows, for this
discussion I will use the jdbc isolation notation.  In data only locking
the key for all locks whether obtained in the btree or in the base table
is the address of the base table row.  In the btree case the address of
the base table row is always the last column of the leaf index entry.   :
serializable: Derby uses "previous key" locking to prevent phantoms from
~              being inserted into a range being accessed by a scan.  A
~               	      scan always locks the index row just "previous" to
the  			      first entry that satisfies the scan condition and then
all 	      index entries going forward until the end of the scan.  			
~    All row locks held until end of transaction.  Inserts also
~              get previous key locks, which will block if the existing 		
~              row is locked.

repeatable read:  Same as serializable except that no previous key
~              	      locking is done.

read committed: write locks held until end of transaction.  Read locks
~               are released once the query in question no longer needs
~               them.

read uncommitted: no row locks are gotten.  The code still gets table
level intent locks to prevent concurrent ddl during the query.

In both the btree and the heap, delete are first executed by marking
a "deleted" bit.  This is to insure space on the page for abort, since
row level locking will allow other rows on the page to be modified
conncurrently with transaction executing the delete.  The system uses
a background daemon to schedule work after commit to reclaim the space
of the deleted rows.  A row marked deleted can be "purged" if one can
obtain a lock on it (if it was an uncommitted delete then the
transaction doing the commit would still have an exclusive lock on the
row).



Dibyendu Majumdar wrote:
| "Mike Matrigali" wrote:
|
|
|>I am happy to answer any specific questions about the store module you
|>have, or can find someone to answer them if I don't know.  I
|>think questions should be directed to the mail list so that everyone
|>can share in the discussion.  If I miss something feel free to ping
|>me to get an answer.  I am an IBM employee still working on the derby
|>codeline and was one of the original developers at Cloudscape when it
|>first began as a startup.
|
|
| Thank you very much, this will really help me to get started.
|
|
|>Is there any model/standard in apache or opensource that we could follow
|>or at least look at for a format for this kind of information.
|
|
| I have some thoughts on how I want to do this - these are given below.
|
|
|>I personally really like design information embedded in the actual code
|>it relates to, rather than having separate documents.
|
|
| I think that well documented code is always a good idea. And there is an
| argument that documentation closer to the code is more likely to be kept
| up-to-date. It is also a great help to anyone fixing the code.
|
| However, separate documents are also very useful. Specially because
you can
| illustrate concepts and design with nice diagrams and flow charts. It also
| enables people to grasp the design more easily, as things can be explained
| top-down rather than jumping straight to implementation details.
|
|
|>Is there any more specific area you are looking at, rather than just
|>store.
|
|
|>I usually divide the store into the following, some overlap:
|>o store module interface (java protocol used by other modules to
|>interact with store, this is actually fairly well documented in the
|>javadoc already available)
|>o access methods (btree/heap existing, future might be gist/rtree/...)
|>o recovery (both restart and crash recovery, logging, xa recovery,  ...)
|>o transactions (locking, isolation levels, xa transactions)
|>o data (actual physical storage of rows/columns on pages, page
|>management, cache  - the format of rows/columns/pages is reasonably
|>documented in the code - I think in the javadoc, but might just be in
|>comments)
|>o misc (sorting, encryption, ...)
|
|
| Well, I shall be documenting the whole of it., but will start with a
| particular item, and then move on to other items.
|
| I think that documentation should be at two levels.
|
| The more useful level is where ideas and concepts are explained. I am sure
| that Cloudscape was built upon ideas and concepts that had been invented
| before by others, and perhaps built its own innovations upon those
| foundations.
| So I want to explore the origins of those ideas, where did the ideas come
| from? How did Cloudscape implementation use those ideas? I'd like to give
| some examples to illustrate what I mean.
|
| If we take the transaction manager for a start. As far as I understand,
| Derby uses the ARIES algorithm for write-ahead logging. What I'd like to
| understand and document is how the Derby implementation of ARIES is
| different from the published algorithms. Does Derby use nested top-level
| actions or sub transactions to manage internal atomic operations? I think
| I saw a comment in the code somehere that Derby does not yet
| implement the transaction list in checkpoint records. I assume that Derby
| does page-oriented redo during recovery redo pass, and then
| page-oriented or logical undo during the undo pass.
|
| For the BTREE implementation, I'd like to understand whether Derby
| implements ARIES/IM or some other algorithm. How does it handle
| structural consistency of the tree during concurrent updates? What are
| the locking/latching implications of the algorithm used? Are accesses to
| the entire tree or subtree serialised during structural modifications?
| How is repeatable read ensured? Does Derby use data locking row or
| index row locking? Does Derby delete rows logically first, and
| physically afterwards? How does it determine that a deleted row can
| be physically removed, ie, the transaction that deleted the row has
| committed?
|
| I understand that Derby uses Log Actions to perform both normal
| operations and recovery operations. Thus a key insert is first encoded as
| a log operation, logged, and then executed. This is very neat.
|
| I also get the impression by reading the code that a BTREE is
| implemented on top of a regular table like structure. This is also
| very neat as it resuses all the code for handling rows, locking, etc.
|
| These are a few examples, and I could list several more. What would be
| very useful is if you and other IBM developers could identify the
algorithms
| that Derby uses. Perhaps you could provide references to published
| research papers, books, and patents. I could start by writing a high level
| overview of the concepts used by Derby, and gradually build up the
| detail by reading  the code, and by asking questions in this forum.
|
| The second level of documentation can be built from the comments
| in the code. This is the implementation level where I can describe how
| Derby actually implements something, what the file formats are, how the
| write ahead log is implementated, what a conglomerate is, what a
| record id and how it is different from slot id, etc.
|
| In terms of starting this level of documentation, I would first like to
| concentrate on the "raw" module, which is I suspect is at the heart
| of all other modules.
|
| Thanks and Regards
|
| Dibyendu
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8VntEpeslyHqPs0RAvamAKCz+Krfmgtb7d2e6MrU68Zg1jgSBACg6lbM
Ft/o3N++pWYHO7HzEZZzXQ0=
=l20l
-----END PGP SIGNATURE-----

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

"Mike Matrigali" wrote:

> I am happy to answer any specific questions about the store module you
> have, or can find someone to answer them if I don't know.  I
> think questions should be directed to the mail list so that everyone
> can share in the discussion.  If I miss something feel free to ping
> me to get an answer.  I am an IBM employee still working on the derby
> codeline and was one of the original developers at Cloudscape when it
> first began as a startup.

Thank you very much, this will really help me to get started.

> Is there any model/standard in apache or opensource that we could follow
> or at least look at for a format for this kind of information.

I have some thoughts on how I want to do this - these are given below.

> I personally really like design information embedded in the actual code
> it relates to, rather than having separate documents.

I think that well documented code is always a good idea. And there is an
argument that documentation closer to the code is more likely to be kept
up-to-date. It is also a great help to anyone fixing the code.

However, separate documents are also very useful. Specially because you can
illustrate concepts and design with nice diagrams and flow charts. It also
enables people to grasp the design more easily, as things can be explained
top-down rather than jumping straight to implementation details.

> Is there any more specific area you are looking at, rather than just
> store.

> I usually divide the store into the following, some overlap:
> o store module interface (java protocol used by other modules to
> interact with store, this is actually fairly well documented in the
> javadoc already available)
> o access methods (btree/heap existing, future might be gist/rtree/...)
> o recovery (both restart and crash recovery, logging, xa recovery,  ...)
> o transactions (locking, isolation levels, xa transactions)
> o data (actual physical storage of rows/columns on pages, page
> management, cache  - the format of rows/columns/pages is reasonably
> documented in the code - I think in the javadoc, but might just be in
> comments)
> o misc (sorting, encryption, ...)

Well, I shall be documenting the whole of it., but will start with a
particular item, and then move on to other items.

I think that documentation should be at two levels.

The more useful level is where ideas and concepts are explained. I am sure
that Cloudscape was built upon ideas and concepts that had been invented
before by others, and perhaps built its own innovations upon those
foundations.
So I want to explore the origins of those ideas, where did the ideas come
from? How did Cloudscape implementation use those ideas? I'd like to give
some examples to illustrate what I mean.

If we take the transaction manager for a start. As far as I understand,
Derby uses the ARIES algorithm for write-ahead logging. What I'd like to
understand and document is how the Derby implementation of ARIES is
different from the published algorithms. Does Derby use nested top-level
actions or sub transactions to manage internal atomic operations? I think
I saw a comment in the code somehere that Derby does not yet
implement the transaction list in checkpoint records. I assume that Derby
does page-oriented redo during recovery redo pass, and then
page-oriented or logical undo during the undo pass.

For the BTREE implementation, I'd like to understand whether Derby
implements ARIES/IM or some other algorithm. How does it handle
structural consistency of the tree during concurrent updates? What are
the locking/latching implications of the algorithm used? Are accesses to
the entire tree or subtree serialised during structural modifications?
How is repeatable read ensured? Does Derby use data locking row or
index row locking? Does Derby delete rows logically first, and
physically afterwards? How does it determine that a deleted row can
be physically removed, ie, the transaction that deleted the row has
committed?

I understand that Derby uses Log Actions to perform both normal
operations and recovery operations. Thus a key insert is first encoded as
a log operation, logged, and then executed. This is very neat.

I also get the impression by reading the code that a BTREE is
implemented on top of a regular table like structure. This is also
very neat as it resuses all the code for handling rows, locking, etc.

These are a few examples, and I could list several more. What would be
very useful is if you and other IBM developers could identify the algorithms
that Derby uses. Perhaps you could provide references to published
research papers, books, and patents. I could start by writing a high level
overview of the concepts used by Derby, and gradually build up the
detail by reading  the code, and by asking questions in this forum.

The second level of documentation can be built from the comments
in the code. This is the implementation level where I can describe how
Derby actually implements something, what the file formats are, how the
write ahead log is implementated, what a conglomerate is, what a
record id and how it is different from slot id, etc.

In terms of starting this level of documentation, I would first like to
concentrate on the "raw" module, which is I suspect is at the heart
of all other modules.

Thanks and Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am happy to answer any specific questions about the store module you
have, or can find someone to answer them if I don't know.  I
think questions should be directed to the mail list so that everyone
can share in the discussion.  If I miss something feel free to ping
me to get an answer.  I am an IBM employee still working on the derby
codeline and was one of the original developers at Cloudscape when it
first began as a startup.

Is there any model/standard in apache or opensource that we could follow
or at least look at for a format for this kind of information.

What little documentation that exists for the store is in comments in
the code.  As a startup we did not do much in the way of design documents.

I personally really like design information embedded in the actual code
it relates to, rather than having separate documents.  All the store
files are javadoc enabled so it may be possible to expand on what is
already there and then extract it into the form you are looking form -
maybe just a web page with links to extracted javadoc?

Is there any more specific area you are looking at, rather than just
store.  I usually divide the store into the following, some overlap:

o store module interface (java protocol used by other modules to
interact with store, this is actually fairly well documented in the
javadoc already available)
o access methods (btree/heap existing, future might be gist/rtree/...)
o recovery (both restart and crash recovery, logging, xa recovery,  ...)
o transactions (locking, isolation levels, xa transactions)
o data (actual physical storage of rows/columns on pages, page
management, cache  - the format of rows/columns/pages is reasonably
documented in the code - I think in the javadoc, but might just be in
comments)
o misc (sorting, encryption, ...)

Mike Matrigali


Dibyendu Majumdar wrote:
| Hello,
|
| If it is okay to do this, I'd like to volunteer to extract information out
| of the Java source files and create design documents. Can I also borrow
| stuff from the presentations? Where I get stuck, if  IBM developers can
| answer questions, that will be great.
|
| I am interested in the "store" module so I can start from there. Is there
| any place where work-in-progress documents can be stored ?
|
| BTW I have been using Derby as one the databases I test against in my
| project SimpleJTA - http://www.simplejta.org. I have to say that the Derby
| XA implementation seems more stable and correct than Oracle's. I am
eagerly
| looking forward to forthcoming support for XA in the Network Server.
|
| Regards
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7wnvEpeslyHqPs0RAuqFAJ9RU4DCIWWVKbQH3S48W/Rpg7HzTwCfZ/12
lW51YEury08Pap6G5iw8x5I=
=7X8f
-----END PGP SIGNATURE-----

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:

> As part of your project are you developing any sort of jdbc XA test
> suites that you could possibly donate to derby.  The derby XA testing
> has always been on the lite side, not having a lot a real world use in
> house.  Any more real world testing that can be added can only help.

Okay, I will do, once I have them in a form that is reusable.

> I saw on your site you mention an issue with derby, XA, branch and
> single phase commit.  Could you file a derby issue for this if you
> believe the derby implementation does not meet jdbc and XA specification
> - - especially if you have a jdbc example showing the problem.

I will file an issue. I have a test case that demonstrates the problem, and
I will dig up the relevant XA specification.

Thanks and Regards

Dibyendu

XA One-Phase commit - was Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike wrote:

> I saw on your site you mention an issue with derby, XA, branch and
> single phase commit.  Could you file a derby issue for this if you
> believe the derby implementation does not meet jdbc and XA specification

I have checked the XA specification and to my chagrin (for I hate to be
wrong ;-), Derby is right and I am wrong. The documentation for xa_commit()
says:

"All associations for *xid must have been ended by using xa_end() with
TMSUCCESS set in flags."

The good thing is that it builds my confidence in Derby's XA implementation.
As I test more of Derby's XA functionality, I will report my findings. I
will remove the erroneous statement from my issues list.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

As part of your project are you developing any sort of jdbc XA test
suites that you could possibly donate to derby.  The derby XA testing
has always been on the lite side, not having a lot a real world use in
house.  Any more real world testing that can be added can only help.

Any tests you could donate would help insure that the network server
version will work well with your project in the future.

I saw on your site you mention an issue with derby, XA, branch and
single phase commit.  Could you file a derby issue for this if you
believe the derby implementation does not meet jdbc and XA specification
- - especially if you have a jdbc example showing the problem.

Dibyendu Majumdar wrote:
| Hello,
|
| If it is okay to do this, I'd like to volunteer to extract information out
| of the Java source files and create design documents. Can I also borrow
| stuff from the presentations? Where I get stuck, if  IBM developers can
| answer questions, that will be great.
|
| I am interested in the "store" module so I can start from there. Is there
| any place where work-in-progress documents can be stored ?
|
| BTW I have been using Derby as one the databases I test against in my
| project SimpleJTA - http://www.simplejta.org. I have to say that the Derby
| XA implementation seems more stable and correct than Oracle's. I am
eagerly
| looking forward to forthcoming support for XA in the Network Server.
|
| Regards
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7wwjEpeslyHqPs0RAt9UAJ9iH88Ra/T1WxoGLmAeIrAf10qPUQCdFe/5
VAsSbOqq63cmGfGPpWKAIm8=
=/ciB
-----END PGP SIGNATURE-----

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Yes SMO's happen in nested transactions and commit separately from
the transaction that initiated the split.  Once the split is
committed is stays, even if the transaction which caused the split
decides to abort.

Dibyendu Majumdar wrote:

> Mike Matrigali wrote:
> 
>> As you note undo of structural changes are
>> physical.  So at the high level recovery goes as follows:
>>
>> Find the oldest log record that needs to be looked at.
>> Go forward from that point and redo each record in log file
>>     (if version of page is later than log record then no work to do)
>> Undo all nested top transactions which must only require physical undo,
>>     which are basically structural changes
>>     like btree split and/or page allocations.  Now all tree structures
>>     are intact so that logical undo is possible.
> 
> 
> Mike,
> 
> About structural changes - I assume that only incomplete structural
> changes are undone? If a structural change is complete, then even if the
> transaction that caused that structural change aborts, there should be
> no need to undo the structural change itself, although the key insert/or
> delete has to be obviously undone.
> 
> If this is not the case, then how does Derby handle the situation where
> a structural change has been made, and subsequently other transactions
> have made more changes, and then the original transaction that made the
> structural change, decides to abort?
> 
> Regards
> 
> Dibyendu
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike Matrigali wrote:
> As you note undo of structural changes are
> physical.  So at the high level recovery goes as follows:
> 
> Find the oldest log record that needs to be looked at.
> Go forward from that point and redo each record in log file
>     (if version of page is later than log record then no work to do)
> Undo all nested top transactions which must only require physical undo,
>     which are basically structural changes
>     like btree split and/or page allocations.  Now all tree structures
>     are intact so that logical undo is possible.

Mike,

About structural changes - I assume that only incomplete structural 
changes are undone? If a structural change is complete, then even if the 
transaction that caused that structural change aborts, there should be 
no need to undo the structural change itself, although the key insert/or 
delete has to be obviously undone.

If this is not the case, then how does Derby handle the situation where 
a structural change has been made, and subsequently other transactions 
have made more changes, and then the original transaction that made the 
structural change, decides to abort?

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

In terms that derby uses all redo log log records are physical, and by
this I mean they apply to one page and the work on them can be redone by
looking at the log record and applying the work to the version of the
page just before the one noted in the log record.  No datatype
comparisons are necessary.

Currently the only logical action is in btrees for undo of insert or
delete operations.  As you note undo of structural changes are
physical.  So at the high level recovery goes as follows:

Find the oldest log record that needs to be looked at.
Go forward from that point and redo each record in log file
    (if version of page is later than log record then no work to do)
Undo all nested top transactions which must only require physical undo,
    which are basically structural changes
    like btree split and/or page allocations.  Now all tree structures
    are intact so that logical undo is possible.

Finally undo any other outstanding transactions, now that tree's are
structurally sound, undo can be physical and/or logical.

XA adds some extra work and should be documented separately.

Dibyendu Majumdar wrote:

> Mike,
> 
> A quick question about logging and recovery:
> 
> Are redo log records always physical - i.e. page orientated?
> 
> I am guessing that undo records are sometimes logical and sometimes
> physical. My guess is that it is physical for structural changes, but
> logical for row inserts/deletes, etc. Is that correct?
> 
> I am going to try and describe logging and recovery next.
> 
> Thanks for your comments on the previous document - I will update and
> resubmit.
> 
> Regards
> 
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Mike,

A quick question about logging and recovery:

Are redo log records always physical - i.e. page orientated?

I am guessing that undo records are sometimes logical and sometimes
physical. My guess is that it is physical for structural changes, but
logical for row inserts/deletes, etc. Is that correct?

I am going to try and describe logging and recovery next.

Thanks for your comments on the previous document - I will update and
resubmit.

Regards

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

answers/comments embedded below


Dibyendu Majumdar wrote:

> Jean,
> 
> Attached is an initial document on Derby page formats. I used forrest 0.6,
> so it should be easy to integrate.
> 
> Is it possible to put this up without publishing it - so that others can
> review and comment on it?
> 
> BTW, it is not complete ....
> 
> Regards
> 
> Dibyendu
> 
> 
> ------------------------------------------------------------------------
> 
> <?xml version="1.0"?>
> <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
> <document> 
>   <header> 
>     <title>Derby On Disk Page Format</title>
>     <abstract>This document describes the storage format of Derby disk pages. 
>     </abstract>
>   </header>
>   <body>
>     <section id="introduction"> 
>       <title> Inroduction </title>
>       <p>Derby stores table and index data in Containers, which currently map 
>         to files in the 
>         <code>seg0</code>
>         directory of the database. Data is stored in pages within the container.</p>
>       <fixme author="Dibyendu Majumdar"> Do all containers map to a single file, or does each container map 
>         to a file? </fixme>
In the current derby implementation there is a a 1 to 1 mapping of
containers to files.  2 containers never map to a single file and 1
container never maps to multiple containers.
>       <p>A page contains a set of records, which can be accessed by "slot", which 
>         defines the order of the records on the page, or by "id" which defines 
>         the identity of the records on the page. Clients access records by both 
>         slot and id, depending on their needs.</p>
>       <p>There are two types of pages - Raw Stored Pages which hold data, and 
>         Raw Stored Alloc Pages which hold page allocation information.</p>
I usually describe the system as having 3 types of pages, though the
Header Page is actually just a specialized version of the Alloc Page.
It looks like you include this later - up to you on how best to describe
the distinction.  I know I always draw the following kind of picture

--------
| header |
----------
| data    |
-----------
| data    |
-----------
| ...     |
-----------
| alloc    |
-----------
| data |

Header Page - This is currently always page 0 of the container.  It
contains information that raw store needs to maintain about the
container once per container, it is currently implemented as an Alloc
Page which "borrows" space from the alloc page for it's information.
The original decision was that we did not want to waste a whole page for
hdr information, so we used part of it and then put the 1st allocation
map on the 2nd half of it.  See AllocPage.java for info about layout and
borrowing.

Allocation Page - After page 0, all subsequent Allocation pages only
have allocation bit maps.

Data page - as you describe
>       <p>A Table or a BTree index provides a row-based access mechanism (row-based 
>         access interface is known as conglomerate). Rows are mapped to records 
>         in pages, in case of a table, a single row can span multiple records in 
>         multiple pages.</p>
>     </section>
>     <section id="storedpage"> 
>       <title>Data Page Format</title>
>       <p>A data page is broken into five sections. 
>         <img src="page-format.png" alt=""/>
>       </p>
>       <section id="formatid"> 
>         <title>Format Id </title>
>         <p> The formatId is a 4 bytes array, it contains the format Id of this 
>           page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.</p>
>       </section>
>       <section id="pageheader"> 
>         <title> Page Header </title>
>         <p> The page header is a fixed size, 56 bytes. </p>
>           <table> 
>             <tr> 
>               <th>Size</th>
>               <th>Type</th>
>               <th>Description</th>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td>boolean</td>
>               <td>is page an overflow page</td>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td>byte</td>
>               <td><p>page status is either VALID_PAGE or INVALID_PAGE(a field 
>                   maintained in base page)</p>
>                 <p>page goes thru the following transition: 
>                   <br/>
>                   VALID_PAGE &lt;-&gt; deallocated page -&gt; free page &lt;-&gt; 
>                   VALID_PAGE</p>
>                 <p>deallocated and free page are both INVALID_PAGE as far as BasePage 
>                   is concerned. 
>                   <br/>
>                   When a page is deallocated, it transitioned from VALID_PAGE 
>                   to INVALID_PAGE. 
>                   <br/>
>                   When a page is allocated, it trnasitioned from INVALID_PAGE 
>                   to VALID_PAGE.</p></td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>pageVersion (a field maintained in base page)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>number of slots in slot offset table</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>next record identifier</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>generation number of this page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>previous generation of this page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>bipLocation</td>
>               <td>the location of the beforeimage page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>number of deleted rows on page. (new release 2.0)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>% of the page to keep free for updates</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>short</td>
>               <td>spare for future use</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>long</td>
>               <td>spare for future use (encryption uses to write random bytes 
>                 here).</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>spare for future use</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>spare for future use</td>
>             </tr>
>           </table>
>           <note>Spare space is guaranteed to be writen with "0", so that future 
>             use of field should not either not use "0" as a valid data item or 
>             pick 0 as a valid default value so that on the fly upgrade can assume 
>             that 0 means field was never assigned. </note>
>         
>       </section>
>       <section id="records"> 
>         <title> Records </title>
>         
>           <p>The records section contains zero or more records. Each record starts 
>             with a Record Header</p>
>           <table> 
>             <caption>Record Header</caption>
>             <tr> 
>               <th>Type</th>
>               <th>Description</th>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td> <p>Status bits for the record header</p>
>                 <table> 
>                   <tr> 
>                     <td>RECORD_INITIAL</td>
>                     <td>used when record header is first initialized</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_DELETED</td>
>                     <td>used to indicate the record has been deleted</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_OVERFLOW</td>
>                     <td>used to indicate the record has been overflowed, it will 
>                       point to the overflow page and ID</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_HAS_FIRST_FIELD</td>
>                     <td>used to indicate that firstField is stored will be stored. 
>                       When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are 
>                       set, part of record is on the page, the record header also 
>                       stores the overflow point to the next part of the record.</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_VALID_MASK</td>
>                     <td>A mask of valid bits that can be set currently, such that 
>                       the following assert can be made: </td>
>                   </tr>
>                 </table></td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>record identifier</td>
>             </tr>
>             <tr> 
>               <td>compressed long</td>
>               <td>overflow page only if RECORD_OVERFLOW is set</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>overflow id only if RECORD_OVERFLOW is set</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>first field only if RECORD_HAS_FIRST_FIELD is set - otherwise 
>                 0</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>number of fields in this portion - only if RECORD_OVERFLOW is 
>                 false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td>
>             </tr>
>           </table>
>           <note label="Long Rows"> A row is long if all of it's columns can't fit on a single page. 
>             When storing a long row, the segment of the row which fits on the 
>             page is left there, and a pointer column is added at the end of the 
>             row. It points to another row in the same container on a different 
>             page. That row will contain the next set of columns and a continuation 
>             pointer if necessary. The overflow portion will be on an "overflow" 
>             page, and that page may have overflow portions of other rows on it 
>             (unlike overflow columns). </note>
>           <p>The Record Header is followed by one or more fields. Each field contains 
>             a Field Header and optional Field Data.</p>
>           <table> 
>             <caption>Stored Field Header Format</caption>
>             <tr> 
>               <td>status</td>
>               <td> <p> The status is 1 byte, it indicates the state of the field. 
>                   A FieldHeader can be in the following states: </p>
>                   <table> 
>                     <tr> 
>                       <td>NULL</td>
>                       <td>if the field is NULL, no field data length is stored</td>
>                     </tr>
>                     <tr> 
>                       <td>OVERFLOW</td>
>                       <td>indicates the field has been overflowed to another page. 
>                         overflow page and overflow ID is stored at the end of 
>                         the user data. field data length must be a number greater 
>                         or equal to 0, indicating the length of the field that 
>                         is stored on the current page. The format looks like this: 
>                         <img src="field-header-overflow.png" alt=""/>
>                         overflowPage will be written as compressed long, overflowId 
>                         will be written as compressed Int</td>
>                     </tr>
>                     <tr> 
>                       <td>NONEXISTENT</td>
>                       <td>the field no longer exists, e.g. column has been dropped 
>                         during an alter table</td>
>                     </tr>
>                     <tr> 
>                       <td>EXTENSIBLE</td>
>                       <td>the field is of user defined data type. The field may 
>                         be tagged.</td>
>                     </tr>
>                     <tr> 
>                       <td>TAGGED</td>
>                       <td>the field is TAGGED if and only if it is EXTENSIBLE.</td>
>                     </tr>
>                     <tr> 
>                       <td>FIXED</td>
>                       <td>the field is FIXED if and only if it is used in the 
>                         log records for version 1.2 and higher.</td>
>                     </tr>
>                   </table>
>                 </td>
>             </tr>
>             <tr> 
>               <td>fieldDataLength</td>
>               <td> The fieldDataLength is only set if the field is not NULL. It 
>                 is the length of the field that is stored on the current page. 
>                 The fieldDataLength is a variable length CompressedInt. </td>
>             </tr>
>             <tr> 
>               <td>fieldData</td>
>               <td><p> Overflow page and overflow id are stored as field data. If 
>                 the overflow bit in status is set, the field data is the overflow 
>                 information. When the overflow bit is not set in status, then, 
>                 fieldData is the actually user data for the field. That means, 
>                 field header consists only field status, and field data length. 
>                 <br/>
>                 A non-overflow field: 
>                 <br/> <img src="field-header-non-overflow.png" alt=""/> <br/>
>                 An overflow field: 
>                 <br/> <img src="field-header-overflow.png" alt=""/> <br/> <strong>overflowPage 
>                   and overflowID</strong> <br/>
>                 The overflowPage is a variable length CompressedLong, overflowID 
>                 is a variable Length CompressedInt. They are only stored when 
>                 the field state is OVERFLOW. And they are not stored in the field 
>                 header. Instead, they are stored at the end of the field data. 
>                 The reason we do that is to save a copy if the field has to overflow. </p>
>               </td>
>             </tr>
>           </table>
>           <note label="Long Columns"> A column is long if it can't fit on a single page. A long column 
>             is marked as long in the base row, and it's field contains a pointer 
>             to a chain of other rows in the same container with contain the data 
>             of the row. Each of the subsequent rows is on a page to itself. Each 
>             subsquent row, except for the last piece has 2 columns, the first 
>             is the next segment of the row and the second is the pointer to the 
>             the following segment. The last segment only has the data segment. 
>           </note>
>         
>       </section>
>       <section id="slottable"> 
>         <title>Slot Offset Table</title>
>         <p>The slot offset table is a table of 6 or 12 bytes per record, depending 
>           on the pageSize being less or greater than 64K: </p>
>           <table> 
>             <caption>Slot Table Record</caption>
>             <tr> 
>               <th>Size</th>
>               <th>Content</th>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>page offset for the record that is assigned to the slot</td>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>the length of the record on this page.</td>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>the length of the reserved number of bytes for this record on 
>                 this page.</td>
>             </tr>
>           </table>
> 		  <p>
>           First slot is slot 0. The slot table grows backwards. Slots are never 
>           left empty. </p>
>       </section>
>       <section id="checksum"> 
>         <title>Checksum</title>
>         <p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents 
>           without the 8 bytes representing the checksum.</p>
>       </section>
> 	</section>
>     <section id="allocpage"> 
>       <title>Allocation Page</title>
>       <p> An allocation page of the file container extends a normal Stored page, 
>         with the exception that a hunk of space may be 'borrowed' by the file 
>         container to store the file header.</p>
>       <p> The borrowed space is not visible to the alloc page even though it is 
>         present in the page data array. It is accessed directly by the FileContainer. 
>         Any change made to the borrowed space is not managed or seen by the allocation 
>         page.</p>
>       <p> The reason for having this borrowed space is so that the container header 
>         does not need to have a page of its own. </p>
>       <p> 
>         <strong>Page Format</strong>
>         <br/>
>         An allocation page extends a stored page, the on disk format is different 
>         from a stored page in that N bytes are 'borrowed' by the container and 
>         the page header of an allocation page will be slightly bigger than a normal 
>         stored page. This N bytes are stored between the page header and the record 
>         space.</p>
>       <p> The reason why this N bytes can't simply be a row is because it needs 
>         to be statically accessible by the container object to avoid a chicken 
>         and egg problem of the container object needing to instantiate an alloc 
>         page object before it can be objectified, and an alloc page object needing 
>         to instantiate a container object before it can be objectified. So this 
>         N bytes must be stored outside of the normal record interface yet it must 
>         be settable because only the first alloc page has this borrowed space. 
>         Other (non-first) alloc page have N == 0. 
>         <br/>
>         <img src="alloc-page.png" alt=""/>
>       </p>
>     </section>
>   </body>
>   <footer> 
>     <legal>This is a legal notice, so it is 
>       <strong>important</strong>
>       .</legal>
>   </footer>
> </document>

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>From: "Jean T. Anderson" <jt...@bristowhill.com>
>  
>
>>Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at 
>>http://incubator.apache.org/derby/papers/pageformats.html.
>>    
>>
>Thanks. Attached is another document :-)
>  
>
"Derby Write Ahead Log Format" is committed and is available at 
http://incubator.apache.org/derby/papers/logformats.html . Thanks for 
the contribution.

-jean

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Mike Matrigali" wrote:

> 2) I would say derby supports media recovery.  One can make a backup of
>    the data and store it off line.  Logs can be stored on a separate
>    disk from the data, and if you lose your data disk then you can use
>    rollforward recovery on the existing logs and the copy of the backup
>    to bring your database up to the current point in time.
>
> 3) Derby does not support "point in time recovery".  Someone may want to
>    look at this in the future.  Technically I don't think it would be
>    very hard as the logging system has the stuff to solve the hard
>    problems.  It does not have an idea about "time" - it just knows log
>    sequence numbers, so need to figure out what kind of interface a user
>    really wants.  A very user unfriendly interface would not be very
>    hard to implement which would be recover to a specific log sequence
>    number.  Anyone interested in this feature should add it to jira -
>    I'll be happy to add technical comments on what needs to be done.
>
> 4) A reasonable next step in derby recovery progress would be to add a
>    way to automatically move/copy log files offline as they are not
>    needed by crash recovery and only needed for media recovery.  Some
>    sort of java stored procedure callout would seem most appropriate.
>

It sounds like the ability to do "point in time recovery" (3 above) may only
require the log to know about "time" and then perform a "media recovery" (2
above) using the appropriate backup copy and the relevant "time" related
portions of the log.

Is the entire log needed or can compensation and other log entries be
stripped out to leave only the pure committed changes?

Re move/copy log files offline (4 above):

Would it be possible to achieve the equivalent of Oracle's "change data
capture" (CDC) facility with this? CDC allows users to save changes made to
the database. Oracle's version uses a PUBLISH/SUBSCRIBE mechanism but I
think we could use your move/copy log file approach very effectively.

A major issue/problem for many current data warehouse applications is
extracting the "changed" data from source systems so that the changes can be
applied via ETL to the target data warehouse tables. The implementations I
have worked on typically include "audit" columns (created by, created date,
modified by, modified date) on the tables; PeopleSoft uses this approach.
There are major problems finding "changed" data using these columns: 1) the
columns are often not accurate because, without triggers, data can be
updated without updating the audit columns, 2) you need indexes on the audit
columns to find the changed records, 3) audit columns cannot provide
information about deleted rows, 4) audit columns cannot provide information
about "all" changes to "all" columns in any given row but only the last
change to each column made prior to the time data was extracted.

I think Mike's simple approach to move/copy of log data could be easily
modified to add "change data capture" functionality to Derby and put Derby
at the forefront of useability for data warehouse support. This would allow
the log files to be "data mined" to extract only the data of tables of
interest while capturing "all" of the changes for time dimension modeling.

RE: Derby architecture/design documents

Posted by Philip Bohannon <bo...@lucent.com>.

Sounds good.  Small comments below.

Philip Bohannon
Computing Sciences Research Center
Lucent Technologies -- Bell Laboratories

> -----Original Message-----
> From: Mike Matrigali [mailto:mikem_app@sbcglobal.net]
> Sent: Tuesday, February 01, 2005 6:45 PM
> To: Derby Development
> Subject: Re: Derby architecture/design documents
>
>
> I think the only hope hear is to define the terms and then document
> what derby does in reference to them.  Philip's definitions seem
> reasonable.  Here is what derby supports in those terms.
>
> 1) Derby fully supports crash recovery, it uses java to correctly
>    sync the log file to support this.
>
> 2) I would say derby supports media recovery.  One can make a backup of
>    the data and store it off line.  Logs can be stored on a separate
>    disk from the data, and if you lose your data disk then you can use
>    rollforward recovery on the existing logs and the copy of the backup
>    to bring your database up to the current point in time.

One issue here is to easily be able to tell from a copy of a
checkpoint/backup what logs are needed to restore it.

Also an issue we ran into (in the database that we built once-upon-a-time)
is that old checkpoints and media recovery may mean "more logs than will
fit on your disk" have to be run. Thus recovery needs to be friendly and
say something like "I need logXXXXXXX, please tell me where it is?", and
wait for it to be mounted, untar'd etc. (Quickly glanced through docs and
didn't see mention of these small, but sometimes practical, points.)

>
> 3) Derby does not support "point in time recovery".  Someone may want to
>    look at this in the future.  Technically I don't think it would be
>    very hard as the logging system has the stuff to solve the hard
>    problems.  It does not have an idea about "time" - it just knows log
>    sequence numbers, so need to figure out what kind of interface a user
>    really wants.  A very user unfriendly interface would not be very
>    hard to implement which would be recover to a specific log sequence
>    number.  Anyone interested in this feature should add it to jira -
>    I'll be happy to add technical comments on what needs to be done.

I believe we added a real-time stamp in the begin transaction record, a
command line parameter to recovery:

recover -timestamp XXXX

and a short 'if' statement in the main recovery loop.

The feature tests, of course, were a lot more trouble :-/.

-----

If the user has access to some stable transaction identifier, it may be
helpful to let the user recover to the commit point of a known
transaction.  This can help when the DB is cooperating with other
processes, but not running XA for whatever reason.

(I may try to create a jira entry, but if someone beats me to it, no
problem.)

>
> 4) A reasonable next step in derby recovery progress would be to add a
>    way to automatically move/copy log files offline as they are not
>    needed by crash recovery and only needed for media recovery.  Some
>    sort of java stored procedure callout would seem most appropriate.
>
> /mikem
>
> Philip Bohannon wrote:
>
> > I believe "crash recovery" means all volatile buffers and non-flushed
disk
> > pages are lost, as well as, depending on your failure model, the total
> > contents of any physical disk page in the process of being written at
the
> > time of the crash.
> >
> > "Media recovery" means that we lost some part of the checkpoint, but
we
> > have archived versions of the log, and some offline copy of an old
> > checkpoint.
> >
> > If you lose your logs (and there is a crash), AFAIK, there is no hope
to
> > recover information past the first loss, assuming it is after the most
> > recent checkpoint.
> >
> > Also a desirable feature is "point in time recovery" against logical
> > corruption - i.e., some angry/confused user started deleting all the
> > customer records Monday morning.  Can I start with an old checkpoint
and
> > run the recovery log until Sunday night to get some consistent state
from
> > around that time back?  (This is usually straightforward to
implement.)
> >
> > Cheers,
> >
> > Philip Bohannon
> > Computing Sciences Research Center
> > Lucent Technologies -- Bell Laboratories
> >
> >
> >>-----Original Message-----
> >>From: Dibyendu Majumdar [mailto:dibyendu@mazumdar.demon.co.uk]
> >>Sent: Tuesday, February 01, 2005 2:43 PM
> >>To: Derby Development
> >>Subject: Re: Derby architecture/design documents
> >>
> >>
> >>From: "Suresh Thalamati" <su...@gmail.com>
> >>
> >>> >>Derby implements the Write Ahead Log using a non-circular file
> >
> > system
> >
> >>>file. At present, there is no support for incremental log backup or
> >>>media recovery. Only crash recovery is supported.
> >>>
> >>>I think derby does support simple media recovery.  It has support for
> >>>full backup/restore and very basic form of  rollforward recovery
> >
> > (replay
> >
> >>>of logs using backup and archived log files).
> >>
> >>Hi Suresh,
> >>
> >>Thanks for the feedback. I suppose that we should define media
recovery
> >
> > and
> >
> >>crash recovery. My understanding is that media recovery is when you
have
> >>lost your logs as
> >>well, whereas crash recovery is when the logs are intact. In case of
> >
> > media
> >
> >>recovery,
> >>does Derby know how to locate the last checkpoint record/log file?
> >>
> >>Regards
> >>
> >>
> >>
>

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I think the only hope hear is to define the terms and then document
what derby does in reference to them.  Philip's definitions seem
reasonable.  Here is what derby supports in those terms.

1) Derby fully supports crash recovery, it uses java to correctly
   sync the log file to support this.

2) I would say derby supports media recovery.  One can make a backup of
   the data and store it off line.  Logs can be stored on a separate
   disk from the data, and if you lose your data disk then you can use
   rollforward recovery on the existing logs and the copy of the backup
   to bring your database up to the current point in time.

3) Derby does not support "point in time recovery".  Someone may want to
   look at this in the future.  Technically I don't think it would be
   very hard as the logging system has the stuff to solve the hard
   problems.  It does not have an idea about "time" - it just knows log
   sequence numbers, so need to figure out what kind of interface a user
   really wants.  A very user unfriendly interface would not be very
   hard to implement which would be recover to a specific log sequence
   number.  Anyone interested in this feature should add it to jira -
   I'll be happy to add technical comments on what needs to be done.

4) A reasonable next step in derby recovery progress would be to add a
   way to automatically move/copy log files offline as they are not
   needed by crash recovery and only needed for media recovery.  Some
   sort of java stored procedure callout would seem most appropriate.

/mikem

Philip Bohannon wrote:

> I believe "crash recovery" means all volatile buffers and non-flushed disk
> pages are lost, as well as, depending on your failure model, the total
> contents of any physical disk page in the process of being written at the
> time of the crash.
> 
> "Media recovery" means that we lost some part of the checkpoint, but we
> have archived versions of the log, and some offline copy of an old
> checkpoint.
> 
> If you lose your logs (and there is a crash), AFAIK, there is no hope to
> recover information past the first loss, assuming it is after the most
> recent checkpoint.
> 
> Also a desirable feature is "point in time recovery" against logical
> corruption - i.e., some angry/confused user started deleting all the
> customer records Monday morning.  Can I start with an old checkpoint and
> run the recovery log until Sunday night to get some consistent state from
> around that time back?  (This is usually straightforward to implement.)
> 
> Cheers,
> 
> Philip Bohannon
> Computing Sciences Research Center
> Lucent Technologies -- Bell Laboratories
> 
> 
>>-----Original Message-----
>>From: Dibyendu Majumdar [mailto:dibyendu@mazumdar.demon.co.uk]
>>Sent: Tuesday, February 01, 2005 2:43 PM
>>To: Derby Development
>>Subject: Re: Derby architecture/design documents
>>
>>
>>From: "Suresh Thalamati" <su...@gmail.com>
>>
>>> >>Derby implements the Write Ahead Log using a non-circular file
> 
> system
> 
>>>file. At present, there is no support for incremental log backup or
>>>media recovery. Only crash recovery is supported.
>>>
>>>I think derby does support simple media recovery.  It has support for
>>>full backup/restore and very basic form of  rollforward recovery
> 
> (replay
> 
>>>of logs using backup and archived log files).
>>
>>Hi Suresh,
>>
>>Thanks for the feedback. I suppose that we should define media recovery
> 
> and
> 
>>crash recovery. My understanding is that media recovery is when you have
>>lost your logs as
>>well, whereas crash recovery is when the logs are intact. In case of
> 
> media
> 
>>recovery,
>>does Derby know how to locate the last checkpoint record/log file?
>>
>>Regards
>>
>>
>>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Philip Bohannon wrote:
> I believe "crash recovery" means all volatile buffers and non-flushed disk
> pages are lost, as well as, depending on your failure model, the total
> contents of any physical disk page in the process of being written at the
> time of the crash.

In other words, the software program/server has crashed but persistent 
storage (disks etc.) is still intact. So anything that was in memory is 
lost, and anything that was being written to disk at the time of crash 
may have been lost.

> 
> "Media recovery" means that we lost some part of the checkpoint, but we
> have archived versions of the log, and some offline copy of an old
> checkpoint.

"Media failure" implies loss of persistent storage - maybe a disk crash.

I suppose that media failure could result in loss of logs and/or loss of 
database files. So it implies that either data files or logs or both may 
need to be restored from backups.

> If you lose your logs (and there is a crash), AFAIK, there is no hope to
> recover information past the first loss, assuming it is after the most
> recent checkpoint.
> 
Of course.

> Also a desirable feature is "point in time recovery" against logical
> corruption - i.e., some angry/confused user started deleting all the
> customer records Monday morning.  Can I start with an old checkpoint and
> run the recovery log until Sunday night to get some consistent state from
> around that time back?  (This is usually straightforward to implement.)

I don't think that Derby currently allows you to arbitrarily rollback to 
a point in time.

Regards

Dibyendu

RE: Derby architecture/design documents

Posted by Philip Bohannon <bo...@lucent.com>.

I believe "crash recovery" means all volatile buffers and non-flushed disk
pages are lost, as well as, depending on your failure model, the total
contents of any physical disk page in the process of being written at the
time of the crash.

"Media recovery" means that we lost some part of the checkpoint, but we
have archived versions of the log, and some offline copy of an old
checkpoint.

If you lose your logs (and there is a crash), AFAIK, there is no hope to
recover information past the first loss, assuming it is after the most
recent checkpoint.

Also a desirable feature is "point in time recovery" against logical
corruption - i.e., some angry/confused user started deleting all the
customer records Monday morning.  Can I start with an old checkpoint and
run the recovery log until Sunday night to get some consistent state from
around that time back?  (This is usually straightforward to implement.)

Cheers,

Philip Bohannon
Computing Sciences Research Center
Lucent Technologies -- Bell Laboratories

> -----Original Message-----
> From: Dibyendu Majumdar [mailto:dibyendu@mazumdar.demon.co.uk]
> Sent: Tuesday, February 01, 2005 2:43 PM
> To: Derby Development
> Subject: Re: Derby architecture/design documents
>
>
> From: "Suresh Thalamati" <su...@gmail.com>
>
> >  >>Derby implements the Write Ahead Log using a non-circular file
system
> > file. At present, there is no support for incremental log backup or
> > media recovery. Only crash recovery is supported.
> >
> > I think derby does support simple media recovery.  It has support for
> > full backup/restore and very basic form of  rollforward recovery
(replay
> > of logs using backup and archived log files).
>
> Hi Suresh,
>
> Thanks for the feedback. I suppose that we should define media recovery
and
> crash recovery. My understanding is that media recovery is when you have
> lost your logs as
> well, whereas crash recovery is when the logs are intact. In case of
media
> recovery,
> does Derby know how to locate the last checkpoint record/log file?
>
> Regards
>
>
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

From: "Suresh Thalamati" <su...@gmail.com>

>  >>Derby implements the Write Ahead Log using a non-circular file system
> file. At present, there is no support for incremental log backup or
> media recovery. Only crash recovery is supported.
>
> I think derby does support simple media recovery.  It has support for
> full backup/restore and very basic form of  rollforward recovery (replay
> of logs using backup and archived log files).

Hi Suresh,

Thanks for the feedback. I suppose that we should define media recovery and
crash recovery. My understanding is that media recovery is when you have
lost your logs as
well, whereas crash recovery is when the logs are intact. In case of media
recovery,
does Derby know how to locate the last checkpoint record/log file?

Regards

Re: Derby architecture/design documents

Posted by Suresh Thalamati <su...@gmail.com>.

Hi  Dibyendu,

Nice writeup.  I think comments are out of date in couple of cases:

 >>Derby implements the Write Ahead Log using a non-circular file system 
file. At present, there is no support for incremental log backup or 
media recovery. Only crash recovery is supported.


I think derby does support simple media recovery.  It has support for 
full backup/restore and very basic form of  rollforward recovery (replay 
of logs using backup and archived log files). 


 >>Everytime a checkpoint is taken, a new log file is created and all 
subsequent log records will go to the new log file. After a checkpoint 
is taken, old and useless log files will be deleted.

 I thing log switch does not  happen always on a checkpoint; with the 
default values log switch happens when a log file grows beyond 1MB and
 a checkpoint happens when the amount of  log written is 10MB or more 
from the last checkpoint.

LogToFile.java:
    private static final int DEFAULT_LOG_SWITCH_INTERVAL = 1024*1024;    
   
    private static final int CHECKPOINT_INTERVAL_MAX     = 128*1024*1024;

I  
 >> 1. The log file grows beyond a certain size (configurable, default 
100K bytes)
Looks like the comments are out of date.  Default is 1MB.
LogToFile.java :     private static final int 
DEFAULT_LOG_SWITCH_INTERVAL = 1024*1024;


Thanks
-suresht



Dibyendu Majumdar wrote:

>From: "Jean T. Anderson" <jt...@bristowhill.com>
>
>  
>
>>Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at 
>>http://incubator.apache.org/derby/papers/pageformats.html.
>>    
>>
>
>Thanks. Attached is another document :-)
>
>Regards
>
>------------------------------------------------------------------------
>
><?xml version="1.0"?>
><!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
><document> 
>  <header> 
>    <title>Derby Write Ahead Log Format</title>
>    <abstract>This document describes the storage format of Derby Write Ahead Log. This is a work-in-progress derived from Javadoc comments 
>    and from explanations Mike Matrigali posted to the Derby lists. 
>    Please post questions, comments, and corrections to derby-dev@db.apache.org.
>    </abstract>
>  </header>
>  <body>
>    <section id="introduction"> 
>      <title> Introduction </title>
>	    <p>
>        Derby implements the Write Ahead Log using a non-circular file system file.
>        At present, there is no support for incremental log backup or media recovery. 
>        Only crash recovery is supported.  
>		</p>
>        <p>
>        The 'log' is a stream of log records.  The 'log' is implemented as
>        a series of numbered log files.  These numbered log files are logically
>        continuous so a transaction can have log records that span multiple log files.
>        A single log record cannot span more then one log file.  The log file number
>        is monotonically increasing.
>		</p>
>        <p>
>        The log belongs to a log factory of a RawStore.  In the current implementation,
>        each RawStore only has one log factory, so each RawStore only has one log
>        (which composed of multiple log files).
>        At any given time, a log factory only writes new log records to one log file,
>        this log file is called the 'current log file'.
>		</p>
>		<p>
>        A log file is named log<em>logNumber</em>.dat
>		</p>
>        <p>
>        Everytime a checkpoint is taken, a new log file is created and all subsequent
>        log records will go to the new log file.  After a checkpoint is taken, old
>        and useless log files will be deleted.
>		</p>
>        <p>
>        RawStore exposes a checkpoint method which clients can call, or a checkpoint is
>        taken automatically by the RawStore when:
>		</p>
>	      <ol>
>	      <li> The log file grows beyond a certain size (configurable, default 100K bytes)</li>
>          <li> RawStore is shutdown and a checkpoint hasn't been done "for a while"</li>
>          <li> RawStore is recovered and a checkpoint hasn't been done "for a while"</li>
>          </ol>
>    </section>
>	<section>
>		<title>
>			Format of Write Ahead Log
>		</title>
>		<p>
>	      An implementation of file based log is <code>org.apache.derby.impl.store.raw.log.LogToFile</code>.
>		This LogFactory is responsible for the formats of 2 kinds of file: the log
>		file and the log control file.  And it is responsible for the format of the
>		log record wrapper.
>		</p>
>		<section>
>			<title>Format of Log Control File</title>
>			<p>The log control file contains information about which log files
>			   are present and where the last checkpoint log record is located.</p>
>			<table>
>				<tr>
>					<th>Type</th>
>					<th>Desciption</th>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>format id set to FILE_STREAM_LOG_FILE</td>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>obsolete log file version</td>
>				</tr>
>				<tr>
>					<td>long</td>
>					<td>the log instant (LogCounter) of the last completed checkpoint</td>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>JBMS (older name for Cloudscape/Derby) version</td>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>checkpoint interval</td>
>				</tr>
>				<tr>
>					<td>long</td>
>					<td>spare (value set to 0)</td>
>				</tr>
>				<tr>
>					<td>long</td>
>					<td>spare (value set to 0)</td>
>				</tr>
>				<tr>
>					<td>long</td>
>					<td>spare (value set to 0)</td>
>				</tr>
>			</table>
>		</section>
>		<section>
>			<title>Format of the log file</title>
>			<p>The log file contains log records which record all the changes
>		    	to the database.  The complete transaction log is composed of a series of
>		    	log files.</p>
>			<table>
>				<tr>
>					<th>Type</th>
>					<th>Description</th>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>Format id of this log file, set to FILE_STREAM_LOG_FILE.</td>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>Obsolete log file version - not used</td>
>				</tr>
>				<tr>
>					<td>long</td>
>					<td>Log file number - this number orders the log files in a
>						series to form the complete transaction log
>					</td>
>				</tr>		 
>				<tr>
>					<td>long</td>
>					<td>PrevLogRecord - log instant of the previous log record, in the
>	    				previous log file.</td>
>				</tr>
>				<tr>
>					<td>[log record wrapper]*</td>
>					<td>one or more log records with wrapper</td>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>EndMarker - value of zero.  The beginning of a log record wrapper
>						is the length of the log record, therefore it is never zero
>					</td>
>				</tr>
>				<tr>
>					<td>[int fuzzy end]*</td>
>					<td>zero or more int's of value 0, in case this log file
>					has been recovered and any incomplete log record set to zero.
>					</td>
>				</tr>
>			</table>
>		</section>
>		<section>
>			<title>Format of the log record wrapper</title>
>			<p>The log record wrapper provides information for the log scan.</p>
>			<table>
>				<tr>
>					<th>Type</th>
>					<th>Description</th>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>length - length of the log record (for forward scan)</td>
>				</tr>
>				<tr>
>					<td>long</td>
>					<td>instant - LogInstant of the log record</td>
>				</tr>
>				<tr>
>					<td>byte[length]</td>
>					<td>logRecord - byte array that is written by the FileLogger</td>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>length - length of the log record (for backward scan)</td>
>				</tr>
>			</table>
>		</section>
>		<section>
>			<title>The format of a log record</title>
>			<p>The log record described every change to the persistent store</p>
>			<table>
>				<tr>
>					<th>Type</th>
>					<th>Description</th>
>				</tr>
>				<tr>
>					<td>int</td>
>					<td>format_id, set to LOG_RECORD. The formatId is written by FormatIdOutputStream 
>                                  when this object is	written out by writeObject
>					</td>
>				</tr>
>				<tr>
>					<td>CompressedInt</td>
>					<td><p>loggable group - the loggable's group value.</p>
>						<p>	
>						Each loggable belongs to one or more groups of similar functionality.
>						</p>
>						<p>
>						Grouping is a way to quickly sort out log records that are interesting
>						to different modules or different implementations.
>						</p>
>						<p>
>						When a module makes loggable and sent it to the log file, it must mark
>						this loggable with one or more of the following group. 
>						If none fit, or if the loggable encompasses functionality that is not
>						described in existing groups, then a new group should be introduced.  
>						</p>
>						<p>
>						Grouping has no effect on how the record is logged or how it is treated
>						in rollback or recovery.
>						</p>
>						<p>
>						The following groups are defined. This list serves as the registry of
>						all loggable groups.
>						</p>
>						<table>
>							<caption>Loggable Groups</caption>
>							<tr>
>								<th>Name</th>
>								<th>Value</th>
>								<th>Description</th>
>							</tr>
>							<tr>
>								<td>FIRST</td>
>								<td>0x1</td>
>								<td>The first operation of a transaction.</td>
>							</tr>
>							<tr>
>								<td>LAST</td>
>								<td>0x2</td>
>								<td>The last operation of a transaction.</td>
>							</tr>
>							<tr>
>								<td>COMPENSATION</td>
>								<td>0x4</td>
>								<td>A compensation log record.</td>
>							</tr>
>							<tr>
>								<td>BI_LOG</td>
>								<td>0x8</td>
>								<td>A BeforeImage log record.</td>
>							</tr>	
>							<tr>
>								<td>COMMIT</td>
>								<td>0x10</td>
>								<td>The transaction committed.</td>
>							</tr>
>							<tr>
>								<td>ABORT</td>
>								<td>0x20</td>
>								<td>The transaction aborted.</td>
>							</tr>
>							<tr>
>								<td>PREPARE</td>
>								<td>0x40</td>
>								<td>The transaction prepared.</td>
>							</tr>
>							<tr>
>								<td>XA_NEEDLOCK</td>
>								<td>0x80</td>
>								<td>Need to reclaim locks associated with theis log record during XA prepared xact recovery.</td>
>							</tr>
>							<tr>
>								<td>RAWSTORE</td>
>								<td>0x100</td>
>								<td>A log record generated by the raw store.</td>
>							</tr>
>							<tr>
>								<td>FILE_RESOURCE</td>
>								<td>0x400</td>
>								<td>related to "non-transactional" files.</td>	
>							</tr>
>						</table>
>					</td>
>				</tr>
>				<tr>
>					<td>TransactionId</td>
>					<td>xactId - The Transaction this log belongs to.</td>
>				</tr>
>				<tr>
>					<td>Loggable</td>
>					<td>op - the log operation</td>
>				</tr>
>			</table>
>		</section>
>	</section>
>  </body>
>  <footer> 
>	<legal>
>	</legal>
>  </footer>
></document>
>
>
>  
>

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I think putting this information somewhere would be good, not sure
what way would be best.  javadoc tries to do some of it.  Here is
the high level:

With Derby start with the "module", derby is made up of a set of modules
like: language, store, access, log, data.  So for the logging system
there is (iapi is the java interface side, impl is the implementation side):

opensource/java/engine/org/apache/derby/iapi/store/raw/log/LogFactory.java
    the java interface for logging system module

opensource/java/engine/org/apache/derby/impl/store/raw/log/LogToFile.java
    the implmentation of the LogFactory.java, also implmenting Module,
this is the one with recovery code.

Quick note on other files in the impl/store/raw/log directory - comments
should be in the files:
CheckpointOperation.java - code to implment checkpoint log record
D_FlushedScan.java - debugging file
D_LogToFile.java - debugging file
FileLogger.java - stuff dealing with putting log records to disk
FlushedScan.java - stuff dealing with scanning the log file
FlushedScanHandle.java - more stuff dealing with scanning the log file
LogAccessFile.java - lowest level putting log records to disk
LogAccessFileBuffer.java - utility for LogAccessFile
LogCounter.java - log sequence number
LogRecord.java - log record
ReadOnly.java - an alternate read only implementation of LogFactory
Scan.java - more scan log file stuff
StreamLogScan.java - more scan log file stuff


RPost wrote:
> re logformats.xml
> 
> I'm know I'm dating myself but in I Love Lucy here's where Ricky would say:
> 
> "Lucy, you got some splainin' to do!"
> 
> The last line of this xml file is:
> 
>       Loggable op - the log operation
> 
> 
> At some point could we get an explanation as to how all of the log related
> *.java files are pieced together?
> 
> These files are difficult to piece together out of context because they are
> a series of interfaces, abstract classes and implementation. I haven't been
> able to sort it out.
> 
> For example:
> 
> public final class DeleteOperation extends LogicalPageOperation
> 
>     public abstract class LogicalPageOperation extends PageBasicOperation
> implements LogicalUndoable
> 
>         public abstract class PageBasicOperation implements Loggable,
> RePreparable
> 
>             public interface Loggable extends Formatable {
> 
> 1. How do all of these pieces fit together?
> 
> 2. What, exactly, gets written to the log record for a delete operation?
> That is, what is 'Loggable op' for this case?
> 
> 3. What are the set of implemented classes that write to the log? The
> DeleteOperation class appears to be one, what are the others?
> 
> 
>

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

re logformats.xml

I'm know I'm dating myself but in I Love Lucy here's where Ricky would say:

"Lucy, you got some splainin' to do!"

The last line of this xml file is:

      Loggable op - the log operation


At some point could we get an explanation as to how all of the log related
*.java files are pieced together?

These files are difficult to piece together out of context because they are
a series of interfaces, abstract classes and implementation. I haven't been
able to sort it out.

For example:

public final class DeleteOperation extends LogicalPageOperation

    public abstract class LogicalPageOperation extends PageBasicOperation
implements LogicalUndoable

        public abstract class PageBasicOperation implements Loggable,
RePreparable

            public interface Loggable extends Formatable {

1. How do all of these pieces fit together?

2. What, exactly, gets written to the log record for a delete operation?
That is, what is 'Loggable op' for this case?

3. What are the set of implemented classes that write to the log? The
DeleteOperation class appears to be one, what are the others?

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

From: "Jean T. Anderson" <jt...@bristowhill.com>

> Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at 
> http://incubator.apache.org/derby/papers/pageformats.html.

Thanks. Attached is another document :-)

Regards

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

> ...
> Attached is an updated recovery.xml. Please note that when you build 
> this in forrest you will get error messages about broken links. This 
> is because I have inserted links to Javadoc documentation. I think 
> that this is okay - when the Javadoc stuff is uploaded, the links can 
> be fixed. In the meatime, I have added a warning about the broken links.
> ...

Thanks for the heads up! It's far better to point to javadoc outside the 
forrest tree than to have forrest build the javadoc (gack). And, in that 
case, there's often no way to avoid broken link errors.

Has there been any decision on where to put the javadoc? I apologize -- 
I've been inattentive to this thread.

and web site changes are live.

regards,

 -jean

Re: Derby architecture/design documents

Posted by scott hutinger <s-...@wiu.edu>.

Thanks for the good work on this stuff.  Very good!

The only feedback I can give; I didn't see specifics on failure.  I 
think in some instances this needs modified, such as in a known 
hostile/always changing environment.  Possibly this doesn't belong here?

scott

Dibyendu Majumdar wrote:

> Jean,
>
> Attached is an updated recovery.xml. Please note that when you build 
> this in forrest you will get error messages about broken links. This 
> is because I have inserted links to Javadoc documentation. I think 
> that this is okay - when the Javadoc stuff is uploaded, the links can 
> be fixed. In the meatime, I have added a warning about the broken links.
>
> All,
>
> As this is a complex area to describe, I would be grateful for your 
> comments and feedback.
>
> Thanks and Regards
>
> Dibyendu

<snip>

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

The page version number is stored in both the in memory and on disk
version of the page.

It is definitely possible for changes to pages to be written to disk
before the changes are committed, but not before the log records
describing the changes to the pages have been forced to disk.  As you
say, the most likely cause for this is a the page cache becoming full.

RPost wrote:

>>"Dibyendu Majumdar" wrote:
> 
> 
>>Although Derby uses LogInstant, it does not save this with the page data.
> 
> Instead, a page version >number is stored. The page version number is also
> stored in the log records associated with the page
> 
> Mike's comments "As you say in derby all the log records are associated with
> a page, and thus have a page version number. That page version number in the
> log is compared with the page version number of the page during redo to
> determine if the log record needs to be applied."
> 
> Does this mean that the page number is stored both in the 'in memory' image
> of the page and also in the version of the page written to disk?
> 
> Are changes to pages ever written to disk before the changes are actually
> committed? For example, if the page cache is full?
> 
> 
>

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Dibyendu Majumdar" wrote:

> Although Derby uses LogInstant, it does not save this with the page data.
Instead, a page version >number is stored. The page version number is also
stored in the log records associated with the page

Mike's comments "As you say in derby all the log records are associated with
a page, and thus have a page version number. That page version number in the
log is compared with the page version number of the page during redo to
determine if the log record needs to be applied."

Does this mean that the page number is stored both in the 'in memory' image
of the page and also in the version of the page written to disk?

Are changes to pages ever written to disk before the changes are actually
committed? For example, if the page cache is full?

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

The simple answer is that every updating operation that comes through
store first creates a log record and writes it to the logging system,
and then applies the change using the log record to the page in
question.  What looks like a single sql operation may be many updating
operations in the store, so in many cases if an error happens there
may already be log records generated and part of error handling will
be to undo all the log records associated with the current statement.

Let me explain what I think happens with a unique key constraint, since
I know that one.  I am not as sure about the foreign key and check
constraint as those are more enforced above store.

unique key constraints are enforced in derby by creating a unique
btree index.

1) To update a column from A to B, language would first find the row
in question either by scanning or using an index if the query allowed.
2) language would call store to update the collumn in the base table
row from A to B, this will generate a log record.
3) language will now check to see if any indexes need to be updated, in
this case there will be a unique index on the column.
4) language will delete the row in the btree for this index, this will
generate a log record.
5) language will attempt to insert a new row into the btree with the new
column value, this insert will fail as there is already one.  This will
not generate a log record as the btree insert fails before it gets that
far.
6) language gets the error and generates a statement level error, and
will generate a statement level abort which will cause undo of this
statement.  These will generate compensation log records.

RPost wrote:

>>"Dibyendu Majumdar" wrote:
> 
> 
>>. . . The forward change from A to B (redo) and the reversal of B to A
> 
> (undo) are both recorded as updates to the system.
> 
> At what point does a change from A to B cause the creation of a log/recovery
> record?
> 
> For example, if a column has a current value of A and the change to B would
> cause a violation of a check, unique or foreign key constraint will a log
> record be created or will the change have already been rejected?
> 
> There will potentially be more complicated scenario's where a change causes
> a trigger, or other side effect, to be executed. Generally speaking, at what
> point do the log/recovery records for these multiple changes get created?
> Only if the attempted changes meet all of the criteria for acceptance? That
> is, only if they would be accepted were a COMMIT to be performed?
> 
>

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Dibyendu Majumdar" wrote:

> . . . The forward change from A to B (redo) and the reversal of B to A
(undo) are both recorded as updates to the system.

At what point does a change from A to B cause the creation of a log/recovery
record?

For example, if a column has a current value of A and the change to B would
cause a violation of a check, unique or foreign key constraint will a log
record be created or will the change have already been rejected?

There will potentially be more complicated scenario's where a change causes
a trigger, or other side effect, to be executed. Generally speaking, at what
point do the log/recovery records for these multiple changes get created?
Only if the attempted changes meet all of the criteria for acceptance? That
is, only if they would be accepted were a COMMIT to be performed?

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Mike Matrigali" wrote:

> The mapping of number to class is
> supported through RegisteredFormatIds.java in same directory.

Thanks, Mike. That was a missing piece.

I saw this comment in StoredFormatIds.java:

  <P>An object should implement the Formatable inteface to support a
  stored format. In addition, the module which contains the object
  should register the object's class with the Monitor (See
  FormatIdUtil.register.)

But there is no FormatIdUtil.register method and I couldn't find where the
mapping was.

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Yes, the log record is serialized out, but it does not use generic
java serialization.  Derby implements it's own serialization scheme
for objects that it owns.  The reason for this is to avoid the overhead
of serialization.  Derby basically uses a single number to identify
the object and knows how to create that object when it is read back
in.  The number is known in derby as a format id.

For log records the work is done by using a FormatIdOutputStream.

So the type is actually something like an InsertOperation (Undoable is
an interface implemented by most if not all of the log records).  Search
for LOGOP in
opensource/java/engine/org/apache/derby/iapi/services/io/StoredFormatIds.java
 to see list of log operations.  The mapping of number to class is
supported through RegisteredFormatIds.java in same directory.

Dibyendu Majumdar wrote:

>> RPost wrote:
>>
>>> 1. Where in the format does a log record indicate the type of record
>>> (redo,
>>> undo, etc)?
>>
>>
> Mike Matrigali wrote:
> 
>> log records are not of type redo, undo, ...   Redo is the recovery phase
>> and during redo you ask every log record to perform redo.  And during
>> undo you ask it to do undo.
> 
> 
> Mike,
> 
> Is my understanding correct that the type (Undoable, for example) of the
> log record is part of the Loggable hierarchy, and that Loggable objects
> are serialized as user data in the LogRecord, and de-serialized back as
> objects when they are being read? Therefore, the correct type of
> Loggable object is automatically created? Relevant code from LogRecord
> given below:
> 
>         Object obj = input.readObject();
> 
>         if (SanityManager.DEBUG) {
>             if ( ! (obj instanceof Loggable))
>                 SanityManager.THROWASSERT(
>                     "log record not getting expected Loggable: got : " +
>                     obj.getClass().getName());
>         }
>         op = (Loggable)obj;
> 
> 
> Regards
> 
> Dibyendu
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

> RPost wrote:
>>1. Where in the format does a log record indicate the type of record (redo,
>>undo, etc)?
> 
Mike Matrigali wrote:
> log records are not of type redo, undo, ...   Redo is the recovery phase
> and during redo you ask every log record to perform redo.  And during
> undo you ask it to do undo.

Mike,

Is my understanding correct that the type (Undoable, for example) of the 
log record is part of the Loggable hierarchy, and that Loggable objects 
are serialized as user data in the LogRecord, and de-serialized back as 
objects when they are being read? Therefore, the correct type of 
Loggable object is automatically created? Relevant code from LogRecord 
given below:

		Object obj = input.readObject();

		if (SanityManager.DEBUG) {
			if ( ! (obj instanceof Loggable))
				SanityManager.THROWASSERT(
					"log record not getting expected Loggable: got : " +
					obj.getClass().getName());
		}
		op = (Loggable)obj;

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

comments below

RPost wrote:
>>"Dibyendu Majumdar" wrote:
> 
> 
>>As this is a complex area to describe . . .
> 
> 
> You got that right!
> 
> 1. Where in the format does a log record indicate the type of record (redo,
> undo, etc)?
log records are not of type redo, undo, ...   Redo is the recovery phase
and during redo you ask every log record to perform redo.  And during
undo you ask it to do undo.
> 
> 2. Is the type of record at all related to the "Groups" defined in
> iapi\store\raw\Loggable.java?

Groups are mainly used so that a scan can ask for just that kind of log
record, or during a scan make it easier to code for a particular type of
log record which might need special handling - for instance a FIRST log
record indicates the first log record of a transaction which needs some
special handling.  It allows for log records to be logged that store
need never look at.
> 
> 3. If not, how do these groups fit into things?
> 
> 4. Are log records always flushed to disk? I have seen references in the
> code to the possible use of an "in memory" database where log records might
> not be flushed. Also, in the simple case of a change to a single field on a
> single page being made by a transaction and then the transaction rolled back
> is it possible that 'no' log record will be flushed to disk or will two log
> records always be flushed even though no commit was ever performed?
There are 2 important flush guarantees:
    1) before any change to a data page goes to disk the associated log
record will be flushed to disk.
    2) before returning to a client application after a commit request,
all log records associated with the transaction must be flushed to disk.

In derby this is always done by flushing all records up to and including
the interesting log record.
> 
> 5. When are log records flushed to disk? Are they buffered in memory until
> the buffer is full or until one of the underlying transactions commits?

see above
> 
> 6.  Mike suggested a possible change to copy the log files to another disk.
> Do the current contents of the log files contain all of the data needed to
> effect "change data capture"? For example, assume that at 8am a backup copy
> of database A is made to another disk. Changes are made to database A
> throughout the day but no changes are made to database B (the copy). At 5pm
> could the log file(s) from database A be used to update database B so that
> database B would again be identical to database A? Or would some log
> operations need to capture additional info (e.g. physical undo vs. logical
> undo) for this to be possible?
> 
> I would be interested in working on this type of 'log mining' for both
> 'point in time' recovery and 'change data capture' type functionality.

yes all the changes are there, this is how current roll forward recovery
works.  I don't think any log operation changes would be necessary other
than adding some way to determine "time" for point in time recovery.  As
has been suggested, adding some time field to the begin or end
transaction log record probably is needed.

> 
> 
>

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Dibyendu Majumdar" wrote:

> As this is a complex area to describe . . .

You got that right!

1. Where in the format does a log record indicate the type of record (redo,
undo, etc)?

2. Is the type of record at all related to the "Groups" defined in
iapi\store\raw\Loggable.java?

3. If not, how do these groups fit into things?

4. Are log records always flushed to disk? I have seen references in the
code to the possible use of an "in memory" database where log records might
not be flushed. Also, in the simple case of a change to a single field on a
single page being made by a transaction and then the transaction rolled back
is it possible that 'no' log record will be flushed to disk or will two log
records always be flushed even though no commit was ever performed?

5. When are log records flushed to disk? Are they buffered in memory until
the buffer is full or until one of the underlying transactions commits?

6.  Mike suggested a possible change to copy the log files to another disk.
Do the current contents of the log files contain all of the data needed to
effect "change data capture"? For example, assume that at 8am a backup copy
of database A is made to another disk. Changes are made to database A
throughout the day but no changes are made to database B (the copy). At 5pm
could the log file(s) from database A be used to update database B so that
database B would again be identical to database A? Or would some log
operations need to capture additional info (e.g. physical undo vs. logical
undo) for this to be possible?

I would be interested in working on this type of 'log mining' for both
'point in time' recovery and 'change data capture' type functionality.

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Dibyendu Majumdar" wrote:

> LogCounter is an implementation of LogInstant.

Why doesn't LogCounter perform error checking (overflow/underflow) for
'fileNumber' and 'filePosition' in the constructors, next and prior methods?
Is this for performance reasons?

LogCounter, like several other 'formatable' classes does not read/write its
own StoredFormatId nor does it have an instance or static variable for it.
Could Mike explain this aspect of the architecture? Is this because the
StoredFormatId must be read in order to know what class to have read the
object and at the point the class reads the object the Id has already been
read?

Why were 'fileNumber' and 'filePosition' defined as longs when they are not
permitted to have values larger than ints?

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean,

Attached is an updated recovery.xml. Please note that when you build 
this in forrest you will get error messages about broken links. This is 
because I have inserted links to Javadoc documentation. I think that 
this is okay - when the Javadoc stuff is uploaded, the links can be 
fixed. In the meatime, I have added a warning about the broken links.

All,

As this is a complex area to describe, I would be grateful for your 
comments and feedback.

Thanks and Regards

Dibyendu

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

> Jean,
>
> Attached is a new document on recovery.

It's committed and the web site updated -- thanks for the contribution!

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean,

Attached is a new document on recovery.

All,

Please review and send me comments and corrections.

Thanks

Dibyendu

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

> Jean,
>
> Attached is an updated logformats.xml.
>
> Thanks

The web site is updated.

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean,

Attached is an updated logformats.xml.

Thanks

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>From: "Jean T. Anderson" <jt...@bristowhill.com>
>  
>
>>Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at 
>>http://incubator.apache.org/derby/papers/pageformats.html.
>>    
>>
>Jean,
>
>Attached is an updated version and an additional aart file.
><snip />
>  
>
'Tis committed.

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

From: "Jean T. Anderson" <jt...@bristowhill.com>
> Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at 
> http://incubator.apache.org/derby/papers/pageformats.html.

Jean,

Attached is an updated version and an additional aart file.

All,

I am going to work on the following areas in the next few days:

a) Logging/recovery
b) BTree notes 

I will post stuff as I progress.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at 
http://incubator.apache.org/derby/papers/pageformats.html.

Dibyendu, for the moment I added it to the "Derby Engine" category on 
the Paper's tab.

In about an hour -- I hope :-) --  I'm going to post a mock-up of what 
the Derby site would look like with more tabs in response to two posts 
last week.

(1) Kevin Foster posted this to derby-user:

> I understand that with the newer version of Forrest being used for the 
> Derby pages, it would be possible to add more tabs across the top 
> without impacting the layout of the content below the tabs.
>
> If that's true, what do people think of the idea of changing the set 
> of tabs to the following:
>
>     Home    Community   Manuals   Papers   Integration
>
> I think that's there enough content and usefulness on Community and 
> Integration to make them tabs in their own right.


(2) Dan Debrunner posted this to derby-dev:

>Is it possible to get the > Community tab to be expanded on the home
>page as well as the > Derby tab?
>
So Dan was reacting to the forrest 0.6 collapsible menus. Forrest 0.6 
doesn't support turning that feature off, though we could modify the 
forrest skin to get the behavior we want. --I don't much like the idea 
of modifying the skin ourselves because I'd rather that it be easy for 
people to check the derby web site out of svn, download forrest, and 
build the site with the defaults. But if anyone feels strongly about 
modifying the skin, speak up!

At any rate, Kevin's suggestion would be a way to address Dan's point.

Forrest 0.6 also has an easy sub-tab feature.  I want to show what tabs 
and subtabs look like, then ask people for feedback on how they want the 
web site to be organized.

 -jean

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>From: "Jean T. Anderson" <jt...@bristowhill.com>
>  
>
>>How about if I put this at the top:
>>
>><note label="Review Draft">
>>This document is a work-in-progress derived from Javadoc comments and 
>>explanations Mike Matrigali posted to the Derby lists. Please post 
>>questions, comments, and corrections to derby-dev@db.apache.org.
>></note>
>>
>>How's that?
>>    
>>
>That would be fine.
>Maybe you can add it as a <notice> in the heading section.
>  
>
As it turns out, content in <notice> is ommitted from the resulting 
page. For now, I'll add it to your <abstract> so it'll get output above 
the table of contents. --We can always change it later.

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

From: "Jean T. Anderson" <jt...@bristowhill.com>
> How about if I put this at the top:
> 
> <note label="Review Draft">
> This document is a work-in-progress derived from Javadoc comments and 
> explanations Mike Matrigali posted to the Derby lists. Please post 
> questions, comments, and corrections to derby-dev@db.apache.org.
> </note>
> 
> How's that?

That would be fine.
Maybe you can add it as a <notice> in the heading section.

Thanks.

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>Jean,
>
>I think that if the document is going to visible to everyone then it is also
>worth mentioning that most of it is derived from Javadoc comments and
>explanations given by Mike Matrigali.
>  
>
How about if I put this at the top:

<note label="Review Draft">
This document is a work-in-progress derived from Javadoc comments and 
explanations Mike Matrigali posted to the Derby lists. Please post 
questions, comments, and corrections to derby-dev@db.apache.org.
</note>

How's that?

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean,

I think that if the document is going to visible to everyone then it is also
worth mentioning that most of it is derived from Javadoc comments and
explanations given by Mike Matrigali.

Thanks

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean wrote:

> I'll try to get this up today, but I need to understand what you mean by 
> "put this up without publishing it" because making it available on the 
> Derby web site so others can review it is in a sense "publishing" it.

I just meant that the document needs to be reviewed for accuracy etc.
A warning that this is WIP and under review will be fine.

Regards

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>...
>Is it possible to put this up without publishing it - so that others can
>review and comment on it?
>  
>
Hi, Dibyendu,

Thanks so much for the contribution!

I'll try to get this up today, but I need to understand what you mean by 
"put this up without publishing it" because making it available on the 
Derby web site so others can review it is in a sense "publishing" it.

I could create a "For Review" tab on the web site and it put it under 
there.
Or, you could add a <warning>Review draft! Post comments to 
derby-dev@db.apache.org</warning> to the top of the page.

Please let me know what would work best.

 -jean

Re: Derby architecture/design documents

Posted by Mike Matrigali <mi...@sbcglobal.net>.

I think some sort of pointers would be nice, and updating the source
files with better comments even nicer.  I can definitely help with
either writing and/or committing better comments in the store code.

Do note that this bigger than a page size stuff is only for people
interested in the internal details of store.  No code above store knows
anything about these page size boundaries, or whether the row/collumn
is long or not.  The store interface hides the implementation from
the clients of store.

There are not limits on row or column size, other than rows have to
fit in a single container and we 64 bit I/O to write those containers so
I guess the limit is something like 2**64 (there are overheads so this
is not exact - it is big enough that I consider it unlimited).  More
real are JVM/OS/hard limits on how big a single file can be, and that
all rows in a single table have to share the space allowed in a single
container.

"Can't fit" does take into account the % reserved for updates and of
course the overhead in the page.  So a table with bigger reserved space
may have different size "long" rows than another table.  This all
happens in StoredPage.java at a very low level where we try to write
the row on the page, and then if it does not fit we "back up" and turn
it into a long row/column.  You won't see any code like if (row size >
constant), because store never knows at insert time how long the row
coming in is because it calls out through interfaces to each of the
collumn datatypes to write themselves out.

RPost wrote:

>>"Dibyendu Majumdar" wrote:
>>
>>I think any questions are fine ... it will only help.
> 
> 
> Would it be useful to put refs to the source file that the comments were
> extracted from? For example, the page header comments are extracted from
> StoredPage.java. A reference to the source file would make it easier for
> users to find the original comments in their original context.
> 
> The section about long rows says 'A row is long if all of it's columns can't
> fit on a single page'.
> 
> 1. Is a row allowed to be larger than the page size?
> 2. Does 'can't fit on a single page take into account the '% of the page to
> keep free for updates'?
> 
> By the way, are you the Dibyendu Majumdar at
> http://www.mazumdar.demon.co.uk/projects.html that is authoring the DM1
> database hosted at SourceForge?
> 
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

From: "RPost" <rp...@pacbell.net>

> Would it be useful to put refs to the source file that the comments were
> extracted from? For example, the page header comments are extracted from
> StoredPage.java. A reference to the source file would make it easier for
> users to find the original comments in their original context.

Yes, I will add that.

> The section about long rows says 'A row is long if all of it's columns
can't
> fit on a single page'.
>
> 1. Is a row allowed to be larger than the page size?

Yes.

> 2. Does 'can't fit on a single page take into account the '% of the page
to
> keep free for updates'?

Don't know ... perhaps someone else can answer this.

> By the way, are you the Dibyendu Majumdar at
> http://www.mazumdar.demon.co.uk/projects.html that is authoring the DM1
> database hosted at SourceForge?

Yes, DM1 is one of my never-ending projects.

Regards

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

> "Dibyendu Majumdar" wrote:
>
> I think any questions are fine ... it will only help.

Would it be useful to put refs to the source file that the comments were
extracted from? For example, the page header comments are extracted from
StoredPage.java. A reference to the source file would make it easier for
users to find the original comments in their original context.

The section about long rows says 'A row is long if all of it's columns can't
fit on a single page'.

1. Is a row allowed to be larger than the page size?
2. Does 'can't fit on a single page take into account the '% of the page to
keep free for updates'?

By the way, are you the Dibyendu Majumdar at
http://www.mazumdar.demon.co.uk/projects.html that is authoring the DM1
database hosted at SourceForge?

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

"RPost" wrote:

> Also, it might be a good idea for you or Jean to make some kind of
statement
> as to what kind of review and comments you want us to make at this early
> stage. Do you want us to include questions that we would like to
ultimately
> see answers for? Or would you rather we hold any questions for a 'more
> questions' session later?

I think any questions are fine ... it will only help.

Thanks

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

I took a look at the attached XML file itself and think it looks good. The
text is clear and concise and I think the format will work well.

Does the posted version need to have some sort of copyright notice in it?

Also, it might be a good idea for you or Jean to make some kind of statement
as to what kind of review and comments you want us to make at this early
stage. Do you want us to include questions that we would like to ultimately
see answers for? Or would you rather we hold any questions for a 'more
questions' session later?

You've done a good job with this piece!

----- Original Message -----
From: "Dibyendu Majumdar" <di...@mazumdar.demon.co.uk>
To: "Derby Development" <de...@db.apache.org>
Sent: Sunday, January 30, 2005 5:16 PM
Subject: Re: Derby architecture/design documents


> Jean,
>
> Attached is an initial document on Derby page formats. I used forrest 0.6,
> so it should be easy to integrate.
>
> Is it possible to put this up without publishing it - so that others can
> review and comment on it?
>
> BTW, it is not complete ....
>
> Regards
>
> Dibyendu
>

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean,

Attached is an initial document on Derby page formats. I used forrest 0.6,
so it should be easy to integrate.

Is it possible to put this up without publishing it - so that others can
review and comment on it?

BTW, it is not complete ....

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

I should really clarify something.

If you check out the derby web site using the instructions in 
http://incubator.apache.org/derby/papers/derby_web.html and you want to 
build that site, you must build it using forrest 0.5.1 until I upgrade 
it to 0.6. (Some things need to be moved to different locations.)

I often build/test new pages in a tiny seed site because it builds so 
much faster. And for that I've been using forrest 0.6.

If you use forrest's v20.dtd, you can easily move xml files between 
0.5.1 and 0.6.

 -jean

Jean T. Anderson wrote:

> Dibyendu Majumdar wrote:
>
>> Jean wrote:
>>
>>> Excellent! Did you download forrest 0.6? That would be good. The 
>>> derby site is currently at 0.5.1, which I'm planning to upgrade to 
>>> 0.6 as soon as I can.
>>> ...
>>
>> I downloaded 0.5.1 but will switch to 0.6. 
>
> If you find yourself using 0.5.1, no worries, but I suggest using the 
> v20.dtd instead of the v12.dtd:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
> "http://forrest.apache.org/dtd/document-v20.dtd">
> <document>
> ....
>
> Using the v20.dtd also apparently positions us better for the leap to 
> 0.7.
>
> -jean
>

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>Jean wrote:
>  
>
>>Excellent! Did you download forrest 0.6? That would be good. The derby 
>>site is currently at 0.5.1, which I'm planning to upgrade to 0.6 as soon 
>>as I can.
>>...
>>    
>>
>I downloaded 0.5.1 but will switch to 0.6.
>  
>

If you find yourself using 0.5.1, no worries, but I suggest using the 
v20.dtd instead of the v12.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
"http://forrest.apache.org/dtd/document-v20.dtd">
<document>
....

Using the v20.dtd also apparently positions us better for the leap to 0.7.

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean wrote:

> Excellent! Did you download forrest 0.6? That would be good. The derby 
> site is currently at 0.5.1, which I'm planning to upgrade to 0.6 as soon 
> as I can. I'm just mentioning that because you might notice that built 
> pages might look slightly different. But don't worry, I'll catch the 
> derby site up to 0.6 real soon now and the work you're doing will be 
> extra motivation.
> 

I downloaded 0.5.1 but will switch to 0.6.

Thanks

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>Jean wrote:
>  
>
>>Posting the document to derby-dev is best. So far I'm the only one doing
>>a 'forrest site' to build the site, but I'm hoping that will change!  :-)
>>    
>>
>I have downloaded forest and will create the documents in forest xml.
>Hopefully, this will make it easier to integrate.
>  
>
Excellent! Did you download forrest 0.6? That would be good. The derby 
site is currently at 0.5.1, which I'm planning to upgrade to 0.6 as soon 
as I can. I'm just mentioning that because you might notice that built 
pages might look slightly different. But don't worry, I'll catch the 
derby site up to 0.6 real soon now and the work you're doing will be 
extra motivation.

>In terms of organising the documents, is it possible to have another tab
>called "Internals" where all the design documents can go? Existing documents
>under "Derby Engine" could also be moved to this tab.
>  
>
That would work. Another option that will work with 0.6 is subtabs. For 
example, that would allow a Papers tab with "Internals" and "Externals" 
subtabs. We can play around with it and see what folks prefer.

>Another point is that some of the stuff under "Derby Engine" would get
>incorporated into the design documentation. I assume that as and when that
>happens, these can be removed.
>  
>
I agree. It would be very confusing to leave the old stuff in place. If 
somebody felt strongly about keeping them available, they could be moved 
into an OLD spot.

>I hope to send my first attempt this weekend - how do I send the files?
>Is a zip file or tar archive okay?
>  
>
Either zip or tar will work fine.

thanks!

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean wrote:

> Posting the document to derby-dev is best. So far I'm the only one doing
> a 'forrest site' to build the site, but I'm hoping that will change!  :-)

I have downloaded forest and will create the documents in forest xml.
Hopefully, this will make it easier to integrate.

In terms of organising the documents, is it possible to have another tab
called "Internals" where all the design documents can go? Existing documents
under "Derby Engine" could also be moved to this tab.

Another point is that some of the stuff under "Derby Engine" would get
incorporated into the design documentation. I assume that as and when that
happens, these can be removed.

I hope to send my first attempt this weekend - how do I send the files?
Is a zip file or tar archive okay?

Thanks and Regards

Dibyendu

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>Jean wrote:
>  
>
>>If you'd like to store them on the derby web site, I'll be happy to make 
>>them available. Or other commiters can commit them and I'll do the 
>>'forrest site' to add them to the built site. 
>>    
>>
>Okay - should I send each document to this list as I go along?
>If I write them up in simple HTML, would that be acceptable? I will
>add diagrams to illustrate concepts as my own understanding improves.
>  
>
Posting the document to derby-dev is best. So far I'm the only one doing 
a 'forrest site' to build the site, but I'm hoping that will change!  :-)

This page suggests a few formats:
http://incubator.apache.org/derby/papers/index.html#How+to+Contribute+Papers

The main thing to decide is if you want the paper to be a stand alone 
doc on the web site or if you want it incorporated into the web site 
with forrest-generated menus.

Stand alone documents are the least work and are the easiest to add to 
the site because they aren't integrated. Here's an example of what I 
mean by stand alone:

    http://incubator.apache.org/derby/DOTS_Derby.html

Integrating html files into the web site are more work; here are two 
examples of what I mean:

   http://incubator.apache.org/derby/DerbyToDo.html
   http://incubator.apache.org/derby/papers/JDBCImplementation.html

If you want the paper to be integrated into the site, the easiest 
approach is a very, very simple html file. (Such a file actually gets 
added to the forrest src tree as an ihtml file.) Don't include a table 
of contents because Forrest generates a table of contents for you based 
on the <hN> tags. And apart from <em>...</em> and <strong>...</strong> 
tags, you really don't even need to specify any fonts -- forrest will 
produce something based on the site siteup.

I have had no success incorporating html files generated by Microsoft 
Word -- so far the forrest result has been a blank page. I have had 
mixed results incorporating html files generated by Open Office. The 
DerbyToDo.html was easily integrated and I suspect that was because it 
started as an Open Office document. Header info in the 
JDBCImplementation.html file suggests it was originally created by 
Microsoft Word, then converted to Open Office ... and it was something 
of a format mess.

So, if you want to integrate an html file into the web site, simplest is 
best.

>BTW, I am doing this in my spare time, so you are most likely to hear from
>me between 22:00 and 01:00 hrs GMT.
>
No problem! And everyone appreciates whatever effort you can devote to it.

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Jean wrote:

> If you'd like to store them on the derby web site, I'll be happy to make 
> them available. Or other commiters can commit them and I'll do the 
> 'forrest site' to add them to the built site. 

Okay - should I send each document to this list as I go along?
If I write them up in simple HTML, would that be acceptable? I will
add diagrams to illustrate concepts as my own understanding improves.

> (And a million thanks for offering to do this!)

My pleasure, and thanks to all of you at IBM for releasing Derby to 
opensource. The chance to understand how a proper DBMS works is 
wonderful. 

BTW, I am doing this in my spare time, so you are most likely to hear from
me between 22:00 and 01:00 hrs GMT.

Regards

Dibyendu

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Dibyendu Majumdar wrote:

>Hello,
>
>If it is okay to do this, I'd like to volunteer to extract information out
>of the Java source files and create design documents.
>... Is there
>any place where work-in-progress documents can be stored ?
>...
>  
>
If you'd like to store them on the derby web site, I'll be happy to make 
them available. Or other commiters can commit them and I'll do the 
'forrest site' to add them to the built site. (And a million thanks for 
offering to do this!)

Also, I've been meaning to start a miscellaneous index page under the 
Derby Engine section on the papers tab 
(http://incubator.apache.org/derby/papers/index.html) that just lists 
pointers to email topics.  That might even help feed your work. --For 
example, Jeff Lichtman posted info about the sql parser a while back 
that I've been meaning to dig out, and there are no doubt other golden 
nuggets getting buried.

 -jean

Re: Derby architecture/design documents

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Hello,

If it is okay to do this, I'd like to volunteer to extract information out
of the Java source files and create design documents. Can I also borrow
stuff from the presentations? Where I get stuck, if  IBM developers can
answer questions, that will be great.

I am interested in the "store" module so I can start from there. Is there
any place where work-in-progress documents can be stored ?

BTW I have been using Derby as one the databases I test against in my
project SimpleJTA - http://www.simplejta.org. I have to say that the Derby
XA implementation seems more stable and correct than Oracle's. I am eagerly
looking forward to forthcoming support for XA in the Network Server.

Regards

Re: Derby architecture/design documents

Posted by Daniel John Debrunner <dj...@debrunners.com>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

RPost wrote:

> IBM doesn't build anything without creating extensive architecture and
> design documents.

That may be true, but IBM didn't create Cloudscape, they acquired the
technology.
http://incubator.apache.org/derby/#Derby+History

The Cloudscape development model has always been light on design
documents, the design was intended to be captured in the code, its
javadoc and general comments in the code. Functional specifications were
written and they are all reflected in the user documentation.

> Did IBM donate their cloudscape architecture and design documents to
Apache?
>
> If not, I suggest that we ask them to donate these documents. Since
IBM may
> have an internal ongoing cloudscape project these documents may need to be
> vetted first.
>
> Who could make this request?

You just did :-)

I'll try and find some time to look to see if anything is available and
useful, however we have been busy donating the code, and then the tests
and now more code for a new network client. My guess is that there is
nothing that exists that would help people understand an overall view of
the system, they may be some focussed papers on certain "edge" areas.

Of course any of the IBM developers on the list with Cloudscape
experience are always willing to answer design or code questions, and I
like the ideas that shahbaz chaudhary proposes in another e-mail.

Dan.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFB7q18Iv0S4qsbfuQRAgdzAJ91AT0ArhkG2qLmAOllrHNyYFG4egCgrrHW
oQDPvWOAQOhow5/259Ui4Gw=
=fM+z
-----END PGP SIGNATURE-----

Re: Derby architecture/design documents

Posted by RPost <rp...@pacbell.net>.

IBM doesn't build anything without creating extensive architecture and
design documents.

Did IBM donate their cloudscape architecture and design documents to Apache?

If not, I suggest that we ask them to donate these documents. Since IBM may
have an internal ongoing cloudscape project these documents may need to be
vetted first.

Who could make this request?

----- Original Message ----- 
From: "Jean T. Anderson" <jt...@bristowhill.com>
To: "Derby Development" <de...@db.apache.org>
Sent: Tuesday, January 18, 2005 10:23 AM
Subject: Re: Derby architecture/design documents


> Ashish Srivastava wrote:
>
> >...
> >Can any one help me in finding the direction where to enter in to
> >Derby as a developer and can any one provide me the link where I can
> >get the architecture and design document of Derby?
> >...
> >
> >
> A couple "derby engine" docs are here:
> http://incubator.apache.org/derby/papers/index.html
>
> Dan Debrunner's "Internals of Derby" presentation can be downloaded from
> here:
> http://incubator.apache.org/derby/papers/MiscPresentations.html
>
> And the "Get Involved!" page has some tips:
> http://incubator.apache.org/derby/derby_comm.html
>
> I hope this helps. --And please feel free to post your own tips for
> others as you figure things out.
>
> regards,
>
>  -jean
>

Re: Derby architecture/design documents

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

Ashish Srivastava wrote:

>...
>Can any one help me in finding the direction where to enter in to
>Derby as a developer and can any one provide me the link where I can
>get the architecture and design document of Derby?
>...
>  
>
A couple "derby engine" docs are here:
http://incubator.apache.org/derby/papers/index.html

Dan Debrunner's "Internals of Derby" presentation can be downloaded from 
here:
http://incubator.apache.org/derby/papers/MiscPresentations.html

And the "Get Involved!" page has some tips:
http://incubator.apache.org/derby/derby_comm.html

I hope this helps. --And please feel free to post your own tips for 
others as you figure things out.

regards,

 -jean