You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-commits@db.apache.org by jt...@apache.org on 2005/02/02 04:29:25 UTC
svn commit: r149475 - in incubator/derby/site/trunk:
build/site/papers/container-format.png build/site/papers/pageformats.html
src/documentation/content/xdocs/papers/container-format.aart
src/documentation/content/xdocs/papers/pageformats.xml
Author: jta
Date: Tue Feb 1 19:29:24 2005
New Revision: 149475
URL: http://svn.apache.org/viewcvs?view=rev&rev=149475
Log:
Committed changes to papers/pageformats by Dibyendu Majumdar <di...@mazumdar.demon.co.uk> .
Added:
incubator/derby/site/trunk/build/site/papers/container-format.png (with props)
incubator/derby/site/trunk/src/documentation/content/xdocs/papers/container-format.aart (with props)
Modified:
incubator/derby/site/trunk/build/site/papers/pageformats.html
incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pageformats.xml
Added: incubator/derby/site/trunk/build/site/papers/container-format.png
URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/build/site/papers/container-format.png?view=auto&rev=149475
==============================================================================
Binary file - no diff available.
Propchange: incubator/derby/site/trunk/build/site/papers/container-format.png
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Modified: incubator/derby/site/trunk/build/site/papers/pageformats.html
URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/build/site/papers/pageformats.html?view=diff&r1=149474&r2=149475
==============================================================================
--- incubator/derby/site/trunk/build/site/papers/pageformats.html (original)
+++ incubator/derby/site/trunk/build/site/papers/pageformats.html Tue Feb 1 19:29:24 2005
@@ -188,10 +188,15 @@
</div>
<h1>Derby On Disk Page Format</h1>
<div class="abstract">This document describes the storage format of Derby disk pages.
+
This is a work-in-progress derived from Javadoc comments and
+
from explanations Mike Matrigali posted to the Derby lists.
+
Please post questions, comments, and corrections to
+
derby-dev@db.apache.org.
+
</div>
<div id="minitoc-area">
<ul class="minitoc">
@@ -220,517 +225,1092 @@
</li>
<li>
<a href="#allocpage">Allocation Page</a>
+<ul class="minitoc">
+<li>
+<a href="#%0A%0A%09Alloc+Page+detailed+implementation+notes">
+
+ Alloc Page detailed implementation notes</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Allocation+Extent">Allocation Extent</a>
</li>
</ul>
</div>
-
-
<a name="N1000F"></a><a name="introduction"></a>
<h2 class="boxed"> Introduction </h2>
<div class="section">
<p>Derby stores table and index data in Containers, which currently map
- to files in the
- <span class="codefrag">seg0</span>
- directory of the database. Data is stored in pages within the container.</p>
-<div class="frame fixme">
-<div class="label">Fixme (Dibyendu Majumdar)</div>
-<div class="content"> Do all containers map to a single file, or does each container map
- to a file? </div>
-</div>
+
+ to files in the <span class="codefrag">seg0</span>
+
+ directory of the database. In the current Derby implementation there is a a 1 to 1 mapping of
+
+ containers to files. Two containers never map to a single file and 1
+
+ container never maps to multiple containers.</p>
+<p>
+
+ Data is stored in pages within the container.</p>
<p>A page contains a set of records, which can be accessed by "slot", which
+
defines the order of the records on the page, or by "id" which defines
+
the identity of the records on the page. Clients access records by both
+
slot and id, depending on their needs.</p>
-<p>There are two types of pages - Raw Stored Pages which hold data, and
- Raw Stored Alloc Pages which hold page allocation information.</p>
<p>A Table or a BTree index provides a row-based access mechanism (row-based
+
access interface is known as conglomerate). Rows are mapped to records
- in pages, in case of a table, a single row can span multiple records in
+
+ in data pages; in case of a table, a single row can span multiple records in
+
multiple pages.</p>
+<p>A container can have three types of pages:</p>
+<ul>
+
+<li>Header Page - which is just a specialized version of the Alloc Page.</li>
+
+<li>Data Pages which hold data, and</li>
+
+<li>Alloc Pages which hold page allocation information. An Alloc page is a specialized verion of the Data page.</li>
+
+</ul>
+<p>The container can be visualised as:<br>
+<img alt="" src="container-format.png"></p>
+<p>
+
+Header Page is currently always page 0 of the container. It
+
+contains information that raw store needs to maintain about the
+
+container once per container, and is currently implemented as an Alloc
+
+Page which "borrows" space from the alloc page for it's information.
+
+The original decision was that the designers did not want to waste a whole page for
+
+header information, so a part of the page was used and the first allocation
+
+map was put on the second half of it. See <span class="codefrag">AllocPage.java</span> for info about layout and
+
+borrowing.
+
+</p>
+<p>
+
+<a href="#allocpage"> Allocation Page</a> - After page 0, all subsequent Allocation pages only
+
+have allocation bit maps.
+
+</p>
</div>
-<a name="N10029"></a><a name="storedpage"></a>
+<a name="N10048"></a><a name="storedpage"></a>
<h2 class="boxed">Data Page Format</h2>
<div class="section">
<p>A data page is broken into five sections.
- <img alt="" src="page-format.png">
- </p>
-<a name="N10036"></a><a name="formatid"></a>
+
+ <img alt="" src="page-format.png"></p>
+<a name="N10054"></a><a name="formatid"></a>
<h3 class="boxed">Format Id </h3>
<p> The formatId is a 4 bytes array, it contains the format Id of this
+
page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.</p>
-<a name="N10040"></a><a name="pageheader"></a>
+<a name="N1005E"></a><a name="pageheader"></a>
<h3 class="boxed"> Page Header </h3>
<p> The page header is a fixed size, 56 bytes. </p>
-<table class="ForrestTable" cellspacing="1" cellpadding="4">
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<tr>
-<tr>
-
<th colspan="1" rowspan="1">Size</th>
- <th colspan="1" rowspan="1">Type</th>
- <th colspan="1" rowspan="1">Description</th>
-
+ <th colspan="1" rowspan="1">Type</th>
+ <th colspan="1" rowspan="1">Description</th>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">1 byte</td>
- <td colspan="1" rowspan="1">boolean</td>
- <td colspan="1" rowspan="1">is page an overflow page</td>
-
+ <td colspan="1" rowspan="1">boolean</td>
+ <td colspan="1" rowspan="1">is page an overflow page</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">1 byte</td>
- <td colspan="1" rowspan="1">byte</td>
- <td colspan="1" rowspan="1">
+ <td colspan="1" rowspan="1">byte</td>
+ <td colspan="1" rowspan="1">
+
<p>page status is either VALID_PAGE or INVALID_PAGE(a field
+
maintained in base page)</p>
-
+
<p>page goes thru the following transition:
+
<br>
+
VALID_PAGE <-> deallocated page -> free page <->
+
VALID_PAGE</p>
-
+
<p>deallocated and free page are both INVALID_PAGE as far as BasePage
+
is concerned.
+
<br>
+
When a page is deallocated, it transitioned from VALID_PAGE
+
to INVALID_PAGE.
+
<br>
+
When a page is allocated, it trnasitioned from INVALID_PAGE
+
to VALID_PAGE.</p>
-</td>
+</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">8 bytes</td>
- <td colspan="1" rowspan="1">long</td>
- <td colspan="1" rowspan="1">pageVersion (a field maintained in base page)</td>
-
+ <td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">pageVersion (a field maintained in base page)</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">2 bytes</td>
- <td colspan="1" rowspan="1">unsigned short</td>
- <td colspan="1" rowspan="1">number of slots in slot offset table</td>
-
+ <td colspan="1" rowspan="1">unsigned short</td>
+ <td colspan="1" rowspan="1">number of slots in slot offset table</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">4 bytes</td>
- <td colspan="1" rowspan="1">integer</td>
- <td colspan="1" rowspan="1">next record identifier</td>
-
+ <td colspan="1" rowspan="1">integer</td>
+ <td colspan="1" rowspan="1">next record identifier</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">4 bytes</td>
- <td colspan="1" rowspan="1">integer</td>
- <td colspan="1" rowspan="1">generation number of this page (Future Use)</td>
-
+ <td colspan="1" rowspan="1">integer</td>
+ <td colspan="1" rowspan="1">generation number of this page (Future Use)</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">4 bytes</td>
- <td colspan="1" rowspan="1">integer</td>
- <td colspan="1" rowspan="1">previous generation of this page (Future Use)</td>
-
+ <td colspan="1" rowspan="1">integer</td>
+ <td colspan="1" rowspan="1">previous generation of this page (Future Use)</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">8 bytes</td>
- <td colspan="1" rowspan="1">bipLocation</td>
- <td colspan="1" rowspan="1">the location of the beforeimage page (Future Use)</td>
-
+ <td colspan="1" rowspan="1">bipLocation</td>
+ <td colspan="1" rowspan="1">the location of the beforeimage page (Future Use)</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">2 bytes</td>
- <td colspan="1" rowspan="1">unsigned short</td>
- <td colspan="1" rowspan="1">number of deleted rows on page. (new release 2.0)</td>
-
+ <td colspan="1" rowspan="1">unsigned short</td>
+ <td colspan="1" rowspan="1">number of deleted rows on page. (new release 2.0)</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">2 bytes</td>
- <td colspan="1" rowspan="1">unsigned short</td>
- <td colspan="1" rowspan="1">% of the page to keep free for updates</td>
-
+ <td colspan="1" rowspan="1">unsigned short</td>
+ <td colspan="1" rowspan="1">% of the page to keep free for updates</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">2 bytes</td>
- <td colspan="1" rowspan="1">short</td>
- <td colspan="1" rowspan="1">spare for future use</td>
-
+ <td colspan="1" rowspan="1">short</td>
+ <td colspan="1" rowspan="1">spare for future use</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">4 bytes</td>
- <td colspan="1" rowspan="1">long</td>
- <td colspan="1" rowspan="1">spare for future use (encryption uses to write random bytes
+ <td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">spare for future use (encryption uses to write random bytes
+
here).</td>
-
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">8 bytes</td>
- <td colspan="1" rowspan="1">long</td>
- <td colspan="1" rowspan="1">spare for future use</td>
-
+ <td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">spare for future use</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">8 bytes</td>
- <td colspan="1" rowspan="1">long</td>
- <td colspan="1" rowspan="1">spare for future use</td>
-
-</tr>
+ <td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">spare for future use</td>
+</tr>
+
</table>
<div class="frame note">
<div class="label">Note</div>
<div class="content">Spare space is guaranteed to be writen with "0", so that future
+
use of field should not either not use "0" as a valid data item or
+
pick 0 as a valid default value so that on the fly upgrade can assume
+
that 0 means field was never assigned. </div>
</div>
-<a name="N1016B"></a><a name="records"></a>
+<a name="N1018B"></a><a name="records"></a>
<h3 class="boxed"> Records </h3>
<p>The records section contains zero or more records. Each record starts
+
with a Record Header</p>
-<table class="ForrestTable" cellspacing="1" cellpadding="4">
-
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
<caption>Record Header</caption>
+
+<tr>
-<tr>
-
<th colspan="1" rowspan="1">Type</th>
- <th colspan="1" rowspan="1">Description</th>
-
+ <th colspan="1" rowspan="1">Description</th>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">1 byte</td>
- <td colspan="1" rowspan="1">
+ <td colspan="1" rowspan="1">
+
<p>Status bits for the record header</p>
+
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
-<table class="ForrestTable" cellspacing="1" cellpadding="4">
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">RECORD_INITIAL</td>
- <td colspan="1" rowspan="1">used when record header is first initialized</td>
-
+ <td colspan="1" rowspan="1">used when record header is first initialized</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">RECORD_DELETED</td>
- <td colspan="1" rowspan="1">used to indicate the record has been deleted</td>
-
+ <td colspan="1" rowspan="1">used to indicate the record has been deleted</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">RECORD_OVERFLOW</td>
- <td colspan="1" rowspan="1">used to indicate the record has been overflowed, it will
+ <td colspan="1" rowspan="1">used to indicate the record has been overflowed, it will
+
point to the overflow page and ID</td>
-
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">RECORD_HAS_FIRST_FIELD</td>
- <td colspan="1" rowspan="1">used to indicate that firstField is stored will be stored.
+ <td colspan="1" rowspan="1">used to indicate that firstField is stored will be stored.
+
When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are
+
set, part of record is on the page, the record header also
+
stores the overflow point to the next part of the record.</td>
-
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">RECORD_VALID_MASK</td>
- <td colspan="1" rowspan="1">A mask of valid bits that can be set currently, such that
+ <td colspan="1" rowspan="1">A mask of valid bits that can be set currently, such that
+
the following assert can be made: </td>
-
-</tr>
+</tr>
+
</table>
-</td>
+</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">compressed int</td>
- <td colspan="1" rowspan="1">record identifier</td>
-
+ <td colspan="1" rowspan="1">record identifier</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">compressed long</td>
- <td colspan="1" rowspan="1">overflow page only if RECORD_OVERFLOW is set</td>
-
+ <td colspan="1" rowspan="1">overflow page only if RECORD_OVERFLOW is set</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">compressed int</td>
- <td colspan="1" rowspan="1">overflow id only if RECORD_OVERFLOW is set</td>
-
+ <td colspan="1" rowspan="1">overflow id only if RECORD_OVERFLOW is set</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">compressed int</td>
- <td colspan="1" rowspan="1">first field only if RECORD_HAS_FIRST_FIELD is set - otherwise
+ <td colspan="1" rowspan="1">first field only if RECORD_HAS_FIRST_FIELD is set - otherwise
+
0</td>
-
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">compressed int</td>
- <td colspan="1" rowspan="1">number of fields in this portion - only if RECORD_OVERFLOW is
+ <td colspan="1" rowspan="1">number of fields in this portion - only if RECORD_OVERFLOW is
+
false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td>
-
-</tr>
+</tr>
+
</table>
<div class="frame note">
<div class="label">Long Rows</div>
<div class="content"> A row is long if all of it's columns can't fit on a single page.
+
When storing a long row, the segment of the row which fits on the
+
page is left there, and a pointer column is added at the end of the
+
row. It points to another row in the same container on a different
+
page. That row will contain the next set of columns and a continuation
+
pointer if necessary. The overflow portion will be on an "overflow"
+
page, and that page may have overflow portions of other rows on it
+
(unlike overflow columns). </div>
</div>
<p>The Record Header is followed by one or more fields. Each field contains
+
a Field Header and optional Field Data.</p>
-<table class="ForrestTable" cellspacing="1" cellpadding="4">
-
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
<caption>Stored Field Header Format</caption>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">status</td>
- <td colspan="1" rowspan="1">
+ <td colspan="1" rowspan="1">
+
<p> The status is 1 byte, it indicates the state of the field.
+
A FieldHeader can be in the following states: </p>
+
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<tr>
-<table class="ForrestTable" cellspacing="1" cellpadding="4">
-
-<tr>
-
<td colspan="1" rowspan="1">NULL</td>
- <td colspan="1" rowspan="1">if the field is NULL, no field data length is stored</td>
-
+ <td colspan="1" rowspan="1">if the field is NULL, no field data length is stored</td>
+
</tr>
-
-<tr>
-
+
+<tr>
+
<td colspan="1" rowspan="1">OVERFLOW</td>
- <td colspan="1" rowspan="1">indicates the field has been overflowed to another page.
+ <td colspan="1" rowspan="1">indicates the field has been overflowed to another page.
+
overflow page and overflow ID is stored at the end of
+
the user data. field data length must be a number greater
+
or equal to 0, indicating the length of the field that
+
is stored on the current page. The format looks like this:
+
<img alt="" src="field-header-overflow.png">
+
overflowPage will be written as compressed long, overflowId
+
will be written as compressed Int</td>
-
+
</tr>
-
-<tr>
-
+
+<tr>
+
<td colspan="1" rowspan="1">NONEXISTENT</td>
- <td colspan="1" rowspan="1">the field no longer exists, e.g. column has been dropped
+ <td colspan="1" rowspan="1">the field no longer exists, e.g. column has been dropped
+
during an alter table</td>
-
+
</tr>
-
-<tr>
-
+
+<tr>
+
<td colspan="1" rowspan="1">EXTENSIBLE</td>
- <td colspan="1" rowspan="1">the field is of user defined data type. The field may
+ <td colspan="1" rowspan="1">the field is of user defined data type. The field may
+
be tagged.</td>
-
+
</tr>
-
-<tr>
-
+
+<tr>
+
<td colspan="1" rowspan="1">TAGGED</td>
- <td colspan="1" rowspan="1">the field is TAGGED if and only if it is EXTENSIBLE.</td>
-
+ <td colspan="1" rowspan="1">the field is TAGGED if and only if it is EXTENSIBLE.</td>
+
</tr>
-
-<tr>
-
+
+<tr>
+
<td colspan="1" rowspan="1">FIXED</td>
- <td colspan="1" rowspan="1">the field is FIXED if and only if it is used in the
+ <td colspan="1" rowspan="1">the field is FIXED if and only if it is used in the
+
log records for version 1.2 and higher.</td>
-
+
</tr>
-
+
</table>
-
-</td>
+</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">fieldDataLength</td>
- <td colspan="1" rowspan="1"> The fieldDataLength is only set if the field is not NULL. It
+ <td colspan="1" rowspan="1"> The fieldDataLength is only set if the field is not NULL. It
+
is the length of the field that is stored on the current page.
+
The fieldDataLength is a variable length CompressedInt. </td>
-
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">fieldData</td>
- <td colspan="1" rowspan="1">
+ <td colspan="1" rowspan="1">
+
<p> Overflow page and overflow id are stored as field data. If
+
the overflow bit in status is set, the field data is the overflow
+
information. When the overflow bit is not set in status, then,
+
fieldData is the actually user data for the field. That means,
+
field header consists only field status, and field data length.
+
<br>
+
A non-overflow field:
- <br>
-<img alt="" src="field-header-non-overflow.png"> <br>
+
+ <br>
+<img alt="" src="field-header-non-overflow.png"><br>
+
An overflow field:
- <br>
-<img alt="" src="field-header-overflow.png"> <br>
+
+ <br>
+<img alt="" src="field-header-overflow.png"><br>
<strong>overflowPage
- and overflowID</strong>
+
+ and overflowID</strong>
<br>
+
The overflowPage is a variable length CompressedLong, overflowID
+
is a variable Length CompressedInt. They are only stored when
+
the field state is OVERFLOW. And they are not stored in the field
+
header. Instead, they are stored at the end of the field data.
+
The reason we do that is to save a copy if the field has to overflow. </p>
-
-</td>
-</tr>
+</td>
+</tr>
+
</table>
<div class="frame note">
<div class="label">Long Columns</div>
<div class="content"> A column is long if it can't fit on a single page. A long column
+
is marked as long in the base row, and it's field contains a pointer
+
to a chain of other rows in the same container with contain the data
+
of the row. Each of the subsequent rows is on a page to itself. Each
+
subsquent row, except for the last piece has 2 columns, the first
+
is the next segment of the row and the second is the pointer to the
+
the following segment. The last segment only has the data segment.
+
</div>
</div>
-<a name="N102C5"></a><a name="slottable"></a>
+<a name="N102E1"></a><a name="slottable"></a>
<h3 class="boxed">Slot Offset Table</h3>
<p>The slot offset table is a table of 6 or 12 bytes per record, depending
+
on the pageSize being less or greater than 64K: </p>
-<table class="ForrestTable" cellspacing="1" cellpadding="4">
-
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
<caption>Slot Table Record</caption>
+
+<tr>
-<tr>
-
<th colspan="1" rowspan="1">Size</th>
- <th colspan="1" rowspan="1">Content</th>
-
+ <th colspan="1" rowspan="1">Content</th>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">2 bytes (unsigned short) or 4 bytes (int)</td>
- <td colspan="1" rowspan="1">page offset for the record that is assigned to the slot</td>
-
+ <td colspan="1" rowspan="1">page offset for the record that is assigned to the slot</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">2 bytes (unsigned short) or 4 bytes (int)</td>
- <td colspan="1" rowspan="1">the length of the record on this page.</td>
-
+ <td colspan="1" rowspan="1">the length of the record on this page.</td>
+
</tr>
+
+<tr>
-<tr>
-
<td colspan="1" rowspan="1">2 bytes (unsigned short) or 4 bytes (int)</td>
- <td colspan="1" rowspan="1">the length of the reserved number of bytes for this record on
+ <td colspan="1" rowspan="1">the length of the reserved number of bytes for this record on
+
this page.</td>
-
-</tr>
+</tr>
+
</table>
<p>
+
First slot is slot 0. The slot table grows backwards. Slots are never
+
left empty. </p>
-<a name="N1030C"></a><a name="checksum"></a>
+<a name="N10328"></a><a name="checksum"></a>
<h3 class="boxed">Checksum</h3>
<p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents
+
without the 8 bytes representing the checksum.</p>
</div>
-<a name="N10317"></a><a name="allocpage"></a>
+<a name="N10333"></a><a name="allocpage"></a>
<h2 class="boxed">Allocation Page</h2>
<div class="section">
<p> An allocation page of the file container extends a normal Stored page,
+
with the exception that a hunk of space may be 'borrowed' by the file
+
container to store the file header.</p>
<p> The borrowed space is not visible to the alloc page even though it is
+
present in the page data array. It is accessed directly by the FileContainer.
+
Any change made to the borrowed space is not managed or seen by the allocation
+
page.</p>
<p> The reason for having this borrowed space is so that the container header
+
does not need to have a page of its own. </p>
-<p>
+<p>
<strong>Page Format</strong>
<br>
+
An allocation page extends a stored page, the on disk format is different
+
from a stored page in that N bytes are 'borrowed' by the container and
+
the page header of an allocation page will be slightly bigger than a normal
+
stored page. This N bytes are stored between the page header and the record
+
space.</p>
<p> The reason why this N bytes can't simply be a row is because it needs
+
to be statically accessible by the container object to avoid a chicken
+
and egg problem of the container object needing to instantiate an alloc
+
page object before it can be objectified, and an alloc page object needing
+
to instantiate a container object before it can be objectified. So this
+
N bytes must be stored outside of the normal record interface yet it must
+
be settable because only the first alloc page has this borrowed space.
+
Other (non-first) alloc page have N == 0.
+
<br>
+<img alt="" src="alloc-page.png"></p>
+<p>
+
+ N is a byte that indicates the size of the borrowed space. Once an alloc
+
+ page is initialized, the value of N cannot change.
+
+ </p>
+<p>
+
+ The maximum space that can be borrowed by the container is 256 bytes.
+
+ </p>
+<p>
+
+ The allocation pages are of the same page size as any other pages in the
+
+ container. The first allocation page of the FileContainer starts at the
+
+ first physical byte of the container. Subsequent allocation pages are
+
+ chained via the nextAllocPageOffset. Each allocation page is expected to
+
+ manage at least 1000 user pages (for 1K page size) so this chaining may not
+
+ be a severe performance hit. The logical -> physical mapping of an
+
+ allocation page is stored in the previous allocation page. The container
+
+ object will need to maintain this mapping.</p>
+<p>
+
+ The following fields are stored in the page header:
+
+ </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<caption>
+
+ Format of Alloc Page
+
+ </caption>
+
+<tr>
+
+<th colspan="1" rowspan="1">
+
+ Type
+
+ </th>
+ <th colspan="1" rowspan="1">
+
+ Description
+
+ </th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">
+
+ int
+
+ </td>
+ <td colspan="1" rowspan="1">
+
+ FormatId
+
+ </td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">StoredPageHeader</td>
+ <td colspan="1" rowspan="1">see <a href="#storedpage">Stored Page Header</a></td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">nextAllocPageNumber - the next allocation page's number</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">nextAllocPageOffset - the file offset of the next allocation page</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">reserved1 - reserved for future usage</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">reserved2 - reserved for future usage</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">reserved3 - reserved for future usage</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">reserved4 - reserved for future usage</td>
+
+</tr>
-<img alt="" src="alloc-page.png">
+<tr>
+
+<td colspan="1" rowspan="1">byte</td>
+ <td colspan="1" rowspan="1">N - the size of the borrowed container info</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">byte[N]</td>
+ <td colspan="1" rowspan="1">containerInfo - the content of the borrowed container info</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">AllocExtent</td>
+ <td colspan="1" rowspan="1">The one and only extent on this alloc page.</td>
+
+</tr>
+
+</table>
+<p>
+
+ The allocation page contains allocation extent rows. In this first cut
+
+ implementation, there is only 1 allocation extent row per allocation page.
+
+ </p>
+<p>
+
+ The allocation extent row is an externalizable object and is directly
+
+ written on to the page by the alloc page. In other words, it will not be
+
+ converted in to a storeableRow. This is to cut down overhead, enhance
+
+ performance and gives more control of the size and layout of the allocation
+
+ extent row to the alloc page.
+
+ </p>
+<a name="N10408"></a><a name="%0A%0A%09Alloc+Page+detailed+implementation+notes"></a>
+<h3 class="boxed">
+
+ Alloc Page detailed implementation notes</h3>
+<p>
+
+ Create Container - an embryonic allocation page is formatted on disk by a
+
+ special static function to avoid instantiating a full AllocPage object.
+
+ This embryonic allocation has enough information that it can find the
+
+ file header and not much else. Then the allocation page is properly
+
+ initialized by creating the first extent.
+
+ </p>
+<p>
+
+ Open Container - A static AllocPage method will be used to read off the
+
+ container information directly from disk. Even if
+
+ the first alloc page (page 0) is already in the page cache, it will not be
+
+ used because cleaning the alloc page will introduce a deadlock if the
+
+ container is not in the container cache. Long term, the first alloc page
+
+ should probably live in the container cache rather than in the page cache.
+
+ </p>
+<p>
+
+ Get Page - The first alloc page (page 0) will be read into the page cache.
+
+ Continue to follow the alloc page chain until the alloc page that manages
+
+ the specified page is found. From the alloc page, the physical offset of
+
+ the specified page is located.
+
+ </p>
+<p>
+
+ Cleaning alloc page - the alloc page is written out the same way any page
+
+ is written out. The container object will provide a call back to the alloc
+
+ page to write the current version of the container object back into the
+
+ borrowed space before the alloc page itself is written out.
+
+ </p>
+<p>
+
+ Cleaning the container object - get the the first alloc page, dirty it and
+
+ clean it (which will cause it to call the container object to write itself
+
+ out into the borrowed space). The versioning of the container is
+
+ independent of the versioning of the alloc page. The container version is
+
+ stored inside the borrowed space and is opaque to the alloc page.
+
+ </p>
+<p>For the fields in an allocation extent row.</p>
+</div>
+
+<a name="N10422"></a><a name="Allocation+Extent"></a>
+<h2 class="boxed">Allocation Extent</h2>
+<div class="section">
+<p>
+
+ An allocation extent row manages the page status of page in the extent.
+
+ AllocExtent is externalizable and is written to the AllocPage directly,
+
+ without being converted to a row first.
+
+ </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<caption>Format of Allocation Extent</caption>
+
+<tr>
+
+<th colspan="1" rowspan="1">Type</th>
+ <th colspan="1" rowspan="1">Description</th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">extentOffset - the begin physical byte offset of the first page of this extent</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">extentStart - the first logical page mananged by this extent.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">extentEnd - the last page this extent can ever hope to manage.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">extentLength - the number of pages allocated in this extent</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">
+
+<p>extentStatus - status bits for the whole extent.
+
+ <br>HAS_DEALLOCATED - most likely, this extent has a deallocated
+
+ page somewhere. If !HAD_DEALLOCATED, the extent has no deallocated page.
+
+ <br>HAS_FREE - most likely, this extent has a free page somewhere.
+
+ If !HAS_FREE, there is no free page in the extent.
+
+ <br>ALL_FREE - most likely, this extent only has free pages, good
+
+ candidate for shrinking the file.
+
+ If !ALL_FREE, the extent is not all free.
+
+ <br>HAS_UNFILLED_PAGES - most likely, this extent has unfilled pages.
+
+ if !HAS_UNFILLED_PAGES, all pages are filled.
+
+ <br>KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages
+
+ (post v1.3). If not set, this extent has no notion of
+
+ unfilled page and has no unFilledPage bitmap.
+
+ <br>NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a
+
+ free page bit maps. Prior to 2.0, there are 2 bit
+
+ maps, a deallocate page bit map and a free page bit
+
+ map. Cloudscape 2.0 and later merged the dealloc page
+
+ bit map into the free page bit map.
+
+ <br>RETIRED - this extent contains only 'retired' pages, never use
+
+ any page from this extent. The pages don't actually
+
+ exist, i.e., it maps to nothing (physicalOffset is
+
+ garbage). The purpose of this extent is to blot out a
+
+ range of logical page numbers that no longer exists
+
+ for this container. Use this to reuse a physical page
+
+ when a logical page has exhausted all recordId or for
+
+ logical pages that has been shrunk out.
+
+ </p>
+
+</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">preAllocLength - the number of pages that have been preallocated</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">reserved1</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">reserved2 - reserved for future use</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">reserved3 - reserved for future use</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">FreePages(bit)</td>
+ <td colspan="1" rowspan="1">Bitmap of free pages. Bit[i] is ON if page is free for immediate (re)use.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">unFilledPages(bit)</td>
+ <td colspan="1" rowspan="1">Bitmap of pages that have free space. Bit[i] is ON if page is likely to be < 1/2 full.</td>
+
+</tr>
+
+</table>
+<p>
+
+ org.apache.derby.iapi.services.io.FormatableBitSet is used to store the bit map.
+
+ FormatableBitSet is an externalizable class.
+
+ </p>
+<p>
+
+ A page can have the following logical state:
+
+ <br>Free - a page that is free to be used
+
+ <br>Valid - a page that is currently in use
+
+ </p>
+<p>
+
+ There is another type of transitional pages which pages that have been
+
+ allocated on disk but has not yet been used. These pages are Free.
+
+ </p>
+<p>
+
+ Bit[K] freePages
+
+ Bit[i] is ON iff page i maybe free for reuse. User must get the
+
+ dealloc page lock on the free page to make sure the transaction.
+
+ </p>
+<p>
+
+ K is the size of the bit array, it must be >= length.
+
</p>
</div>
Added: incubator/derby/site/trunk/src/documentation/content/xdocs/papers/container-format.aart
URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/src/documentation/content/xdocs/papers/container-format.aart?view=auto&rev=149475
==============================================================================
--- incubator/derby/site/trunk/src/documentation/content/xdocs/papers/container-format.aart (added)
+++ incubator/derby/site/trunk/src/documentation/content/xdocs/papers/container-format.aart Tue Feb 1 19:29:24 2005
@@ -0,0 +1,13 @@
++--------+
+| header |
++--------+
+| data |
++--------+
+| data |
++--------+
+| ... |
++--------+
+| alloc |
++--------+
+| data |
++--------+
Propchange: incubator/derby/site/trunk/src/documentation/content/xdocs/papers/container-format.aart
------------------------------------------------------------------------------
svn:eol-style = native
Modified: incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pageformats.xml
URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pageformats.xml?view=diff&r1=149474&r2=149475
==============================================================================
--- incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pageformats.xml (original)
+++ incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pageformats.xml Tue Feb 1 19:29:24 2005
@@ -1,372 +1,855 @@
<?xml version="1.0"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
-<document>
- <header>
+<document>
+ <header>
<title>Derby On Disk Page Format</title>
<abstract>This document describes the storage format of Derby disk pages.
+
This is a work-in-progress derived from Javadoc comments and
+
from explanations Mike Matrigali posted to the Derby lists.
+
Please post questions, comments, and corrections to
+
derby-dev@db.apache.org.
+
</abstract>
</header>
<body>
-
-
- <section id="introduction">
+ <section id="introduction">
<title> Introduction </title>
<p>Derby stores table and index data in Containers, which currently map
- to files in the
- <code>seg0</code>
- directory of the database. Data is stored in pages within the container.</p>
- <fixme author="Dibyendu Majumdar"> Do all containers map to a single file, or does each container map
- to a file? </fixme>
+
+ to files in the <code>seg0</code>
+
+ directory of the database. In the current Derby implementation there is a a 1 to 1 mapping of
+
+ containers to files. Two containers never map to a single file and 1
+
+ container never maps to multiple containers.</p>
+ <p>
+
+ Data is stored in pages within the container.</p>
<p>A page contains a set of records, which can be accessed by "slot", which
+
defines the order of the records on the page, or by "id" which defines
+
the identity of the records on the page. Clients access records by both
+
slot and id, depending on their needs.</p>
- <p>There are two types of pages - Raw Stored Pages which hold data, and
- Raw Stored Alloc Pages which hold page allocation information.</p>
<p>A Table or a BTree index provides a row-based access mechanism (row-based
+
access interface is known as conglomerate). Rows are mapped to records
- in pages, in case of a table, a single row can span multiple records in
+
+ in data pages; in case of a table, a single row can span multiple records in
+
multiple pages.</p>
+ <p>A container can have three types of pages:</p>
+ <ul>
+ <li>Header Page - which is just a specialized version of the Alloc Page.</li>
+ <li>Data Pages which hold data, and</li>
+ <li>Alloc Pages which hold page allocation information. An Alloc page is a specialized verion of the Data page.</li>
+ </ul>
+ <p>The container can be visualised as:<br/><img alt="" src="container-format.png"/></p>
+ <p>
+
+Header Page is currently always page 0 of the container. It
+
+contains information that raw store needs to maintain about the
+
+container once per container, and is currently implemented as an Alloc
+
+Page which "borrows" space from the alloc page for it's information.
+
+The original decision was that the designers did not want to waste a whole page for
+
+header information, so a part of the page was used and the first allocation
+
+map was put on the second half of it. See <code>AllocPage.java</code> for info about layout and
+
+borrowing.
+
+</p>
+ <p>
+ <a href="#allocpage"> Allocation Page</a> - After page 0, all subsequent Allocation pages only
+
+have allocation bit maps.
+
+</p>
</section>
- <section id="storedpage">
+ <section id="storedpage">
<title>Data Page Format</title>
<p>A data page is broken into five sections.
- <img src="page-format.png" alt=""/>
- </p>
- <section id="formatid">
+
+ <img alt="" src="page-format.png"/></p>
+ <section id="formatid">
<title>Format Id </title>
<p> The formatId is a 4 bytes array, it contains the format Id of this
+
page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.</p>
</section>
- <section id="pageheader">
+ <section id="pageheader">
<title> Page Header </title>
<p> The page header is a fixed size, 56 bytes. </p>
- <table>
- <tr>
- <th>Size</th>
- <th>Type</th>
- <th>Description</th>
- </tr>
- <tr>
- <td>1 byte</td>
- <td>boolean</td>
- <td>is page an overflow page</td>
- </tr>
- <tr>
- <td>1 byte</td>
- <td>byte</td>
- <td><p>page status is either VALID_PAGE or INVALID_PAGE(a field
+ <table>
+ <tr>
+ <th>Size</th>
+ <th>Type</th>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td>1 byte</td>
+ <td>boolean</td>
+ <td>is page an overflow page</td>
+ </tr>
+ <tr>
+ <td>1 byte</td>
+ <td>byte</td>
+ <td>
+ <p>page status is either VALID_PAGE or INVALID_PAGE(a field
+
maintained in base page)</p>
- <p>page goes thru the following transition:
+ <p>page goes thru the following transition:
+
<br/>
+
VALID_PAGE <-> deallocated page -> free page <->
+
VALID_PAGE</p>
- <p>deallocated and free page are both INVALID_PAGE as far as BasePage
+ <p>deallocated and free page are both INVALID_PAGE as far as BasePage
+
is concerned.
+
<br/>
+
When a page is deallocated, it transitioned from VALID_PAGE
+
to INVALID_PAGE.
+
<br/>
+
When a page is allocated, it trnasitioned from INVALID_PAGE
- to VALID_PAGE.</p></td>
- </tr>
- <tr>
- <td>8 bytes</td>
- <td>long</td>
- <td>pageVersion (a field maintained in base page)</td>
- </tr>
- <tr>
- <td>2 bytes</td>
- <td>unsigned short</td>
- <td>number of slots in slot offset table</td>
- </tr>
- <tr>
- <td>4 bytes</td>
- <td>integer</td>
- <td>next record identifier</td>
- </tr>
- <tr>
- <td>4 bytes</td>
- <td>integer</td>
- <td>generation number of this page (Future Use)</td>
- </tr>
- <tr>
- <td>4 bytes</td>
- <td>integer</td>
- <td>previous generation of this page (Future Use)</td>
- </tr>
- <tr>
- <td>8 bytes</td>
- <td>bipLocation</td>
- <td>the location of the beforeimage page (Future Use)</td>
- </tr>
- <tr>
- <td>2 bytes</td>
- <td>unsigned short</td>
- <td>number of deleted rows on page. (new release 2.0)</td>
- </tr>
- <tr>
- <td>2 bytes</td>
- <td>unsigned short</td>
- <td>% of the page to keep free for updates</td>
- </tr>
- <tr>
- <td>2 bytes</td>
- <td>short</td>
- <td>spare for future use</td>
- </tr>
- <tr>
- <td>4 bytes</td>
- <td>long</td>
- <td>spare for future use (encryption uses to write random bytes
+
+ to VALID_PAGE.</p>
+ </td>
+ </tr>
+ <tr>
+ <td>8 bytes</td>
+ <td>long</td>
+ <td>pageVersion (a field maintained in base page)</td>
+ </tr>
+ <tr>
+ <td>2 bytes</td>
+ <td>unsigned short</td>
+ <td>number of slots in slot offset table</td>
+ </tr>
+ <tr>
+ <td>4 bytes</td>
+ <td>integer</td>
+ <td>next record identifier</td>
+ </tr>
+ <tr>
+ <td>4 bytes</td>
+ <td>integer</td>
+ <td>generation number of this page (Future Use)</td>
+ </tr>
+ <tr>
+ <td>4 bytes</td>
+ <td>integer</td>
+ <td>previous generation of this page (Future Use)</td>
+ </tr>
+ <tr>
+ <td>8 bytes</td>
+ <td>bipLocation</td>
+ <td>the location of the beforeimage page (Future Use)</td>
+ </tr>
+ <tr>
+ <td>2 bytes</td>
+ <td>unsigned short</td>
+ <td>number of deleted rows on page. (new release 2.0)</td>
+ </tr>
+ <tr>
+ <td>2 bytes</td>
+ <td>unsigned short</td>
+ <td>% of the page to keep free for updates</td>
+ </tr>
+ <tr>
+ <td>2 bytes</td>
+ <td>short</td>
+ <td>spare for future use</td>
+ </tr>
+ <tr>
+ <td>4 bytes</td>
+ <td>long</td>
+ <td>spare for future use (encryption uses to write random bytes
+
here).</td>
- </tr>
- <tr>
- <td>8 bytes</td>
- <td>long</td>
- <td>spare for future use</td>
- </tr>
- <tr>
- <td>8 bytes</td>
- <td>long</td>
- <td>spare for future use</td>
- </tr>
- </table>
- <note>Spare space is guaranteed to be writen with "0", so that future
+ </tr>
+ <tr>
+ <td>8 bytes</td>
+ <td>long</td>
+ <td>spare for future use</td>
+ </tr>
+ <tr>
+ <td>8 bytes</td>
+ <td>long</td>
+ <td>spare for future use</td>
+ </tr>
+ </table>
+ <note>Spare space is guaranteed to be writen with "0", so that future
+
use of field should not either not use "0" as a valid data item or
+
pick 0 as a valid default value so that on the fly upgrade can assume
+
that 0 means field was never assigned. </note>
-
</section>
- <section id="records">
+ <section id="records">
<title> Records </title>
-
- <p>The records section contains zero or more records. Each record starts
+ <p>The records section contains zero or more records. Each record starts
+
with a Record Header</p>
- <table>
- <caption>Record Header</caption>
- <tr>
- <th>Type</th>
- <th>Description</th>
- </tr>
- <tr>
- <td>1 byte</td>
- <td> <p>Status bits for the record header</p>
- <table>
- <tr>
- <td>RECORD_INITIAL</td>
- <td>used when record header is first initialized</td>
- </tr>
- <tr>
- <td>RECORD_DELETED</td>
- <td>used to indicate the record has been deleted</td>
- </tr>
- <tr>
- <td>RECORD_OVERFLOW</td>
- <td>used to indicate the record has been overflowed, it will
+ <table>
+ <caption>Record Header</caption>
+ <tr>
+ <th>Type</th>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td>1 byte</td>
+ <td>
+ <p>Status bits for the record header</p>
+ <table>
+ <tr>
+ <td>RECORD_INITIAL</td>
+ <td>used when record header is first initialized</td>
+ </tr>
+ <tr>
+ <td>RECORD_DELETED</td>
+ <td>used to indicate the record has been deleted</td>
+ </tr>
+ <tr>
+ <td>RECORD_OVERFLOW</td>
+ <td>used to indicate the record has been overflowed, it will
+
point to the overflow page and ID</td>
- </tr>
- <tr>
- <td>RECORD_HAS_FIRST_FIELD</td>
- <td>used to indicate that firstField is stored will be stored.
+ </tr>
+ <tr>
+ <td>RECORD_HAS_FIRST_FIELD</td>
+ <td>used to indicate that firstField is stored will be stored.
+
When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are
+
set, part of record is on the page, the record header also
+
stores the overflow point to the next part of the record.</td>
- </tr>
- <tr>
- <td>RECORD_VALID_MASK</td>
- <td>A mask of valid bits that can be set currently, such that
+ </tr>
+ <tr>
+ <td>RECORD_VALID_MASK</td>
+ <td>A mask of valid bits that can be set currently, such that
+
the following assert can be made: </td>
- </tr>
- </table></td>
- </tr>
- <tr>
- <td>compressed int</td>
- <td>record identifier</td>
- </tr>
- <tr>
- <td>compressed long</td>
- <td>overflow page only if RECORD_OVERFLOW is set</td>
- </tr>
- <tr>
- <td>compressed int</td>
- <td>overflow id only if RECORD_OVERFLOW is set</td>
- </tr>
- <tr>
- <td>compressed int</td>
- <td>first field only if RECORD_HAS_FIRST_FIELD is set - otherwise
+ </tr>
+ </table>
+ </td>
+ </tr>
+ <tr>
+ <td>compressed int</td>
+ <td>record identifier</td>
+ </tr>
+ <tr>
+ <td>compressed long</td>
+ <td>overflow page only if RECORD_OVERFLOW is set</td>
+ </tr>
+ <tr>
+ <td>compressed int</td>
+ <td>overflow id only if RECORD_OVERFLOW is set</td>
+ </tr>
+ <tr>
+ <td>compressed int</td>
+ <td>first field only if RECORD_HAS_FIRST_FIELD is set - otherwise
+
0</td>
- </tr>
- <tr>
- <td>compressed int</td>
- <td>number of fields in this portion - only if RECORD_OVERFLOW is
+ </tr>
+ <tr>
+ <td>compressed int</td>
+ <td>number of fields in this portion - only if RECORD_OVERFLOW is
+
false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td>
- </tr>
- </table>
- <note label="Long Rows"> A row is long if all of it's columns can't fit on a single page.
+ </tr>
+ </table>
+ <note label="Long Rows"> A row is long if all of it's columns can't fit on a single page.
+
When storing a long row, the segment of the row which fits on the
+
page is left there, and a pointer column is added at the end of the
+
row. It points to another row in the same container on a different
+
page. That row will contain the next set of columns and a continuation
+
pointer if necessary. The overflow portion will be on an "overflow"
+
page, and that page may have overflow portions of other rows on it
+
(unlike overflow columns). </note>
- <p>The Record Header is followed by one or more fields. Each field contains
+ <p>The Record Header is followed by one or more fields. Each field contains
+
a Field Header and optional Field Data.</p>
- <table>
- <caption>Stored Field Header Format</caption>
- <tr>
- <td>status</td>
- <td> <p> The status is 1 byte, it indicates the state of the field.
+ <table>
+ <caption>Stored Field Header Format</caption>
+ <tr>
+ <td>status</td>
+ <td>
+ <p> The status is 1 byte, it indicates the state of the field.
+
A FieldHeader can be in the following states: </p>
- <table>
- <tr>
- <td>NULL</td>
- <td>if the field is NULL, no field data length is stored</td>
- </tr>
- <tr>
- <td>OVERFLOW</td>
- <td>indicates the field has been overflowed to another page.
+ <table>
+ <tr>
+ <td>NULL</td>
+ <td>if the field is NULL, no field data length is stored</td>
+ </tr>
+ <tr>
+ <td>OVERFLOW</td>
+ <td>indicates the field has been overflowed to another page.
+
overflow page and overflow ID is stored at the end of
+
the user data. field data length must be a number greater
+
or equal to 0, indicating the length of the field that
+
is stored on the current page. The format looks like this:
- <img src="field-header-overflow.png" alt=""/>
+
+ <img alt="" src="field-header-overflow.png"/>
+
overflowPage will be written as compressed long, overflowId
+
will be written as compressed Int</td>
- </tr>
- <tr>
- <td>NONEXISTENT</td>
- <td>the field no longer exists, e.g. column has been dropped
+ </tr>
+ <tr>
+ <td>NONEXISTENT</td>
+ <td>the field no longer exists, e.g. column has been dropped
+
during an alter table</td>
- </tr>
- <tr>
- <td>EXTENSIBLE</td>
- <td>the field is of user defined data type. The field may
+ </tr>
+ <tr>
+ <td>EXTENSIBLE</td>
+ <td>the field is of user defined data type. The field may
+
be tagged.</td>
- </tr>
- <tr>
- <td>TAGGED</td>
- <td>the field is TAGGED if and only if it is EXTENSIBLE.</td>
- </tr>
- <tr>
- <td>FIXED</td>
- <td>the field is FIXED if and only if it is used in the
+ </tr>
+ <tr>
+ <td>TAGGED</td>
+ <td>the field is TAGGED if and only if it is EXTENSIBLE.</td>
+ </tr>
+ <tr>
+ <td>FIXED</td>
+ <td>the field is FIXED if and only if it is used in the
+
log records for version 1.2 and higher.</td>
- </tr>
- </table>
- </td>
- </tr>
- <tr>
- <td>fieldDataLength</td>
- <td> The fieldDataLength is only set if the field is not NULL. It
+ </tr>
+ </table>
+ </td>
+ </tr>
+ <tr>
+ <td>fieldDataLength</td>
+ <td> The fieldDataLength is only set if the field is not NULL. It
+
is the length of the field that is stored on the current page.
+
The fieldDataLength is a variable length CompressedInt. </td>
- </tr>
- <tr>
- <td>fieldData</td>
- <td><p> Overflow page and overflow id are stored as field data. If
+ </tr>
+ <tr>
+ <td>fieldData</td>
+ <td>
+ <p> Overflow page and overflow id are stored as field data. If
+
the overflow bit in status is set, the field data is the overflow
+
information. When the overflow bit is not set in status, then,
+
fieldData is the actually user data for the field. That means,
+
field header consists only field status, and field data length.
+
<br/>
+
A non-overflow field:
- <br/> <img src="field-header-non-overflow.png" alt=""/> <br/>
+
+ <br/><img alt="" src="field-header-non-overflow.png"/><br/>
+
An overflow field:
- <br/> <img src="field-header-overflow.png" alt=""/> <br/> <strong>overflowPage
- and overflowID</strong> <br/>
+
+ <br/><img alt="" src="field-header-overflow.png"/><br/><strong>overflowPage
+
+ and overflowID</strong><br/>
+
The overflowPage is a variable length CompressedLong, overflowID
+
is a variable Length CompressedInt. They are only stored when
+
the field state is OVERFLOW. And they are not stored in the field
+
header. Instead, they are stored at the end of the field data.
+
The reason we do that is to save a copy if the field has to overflow. </p>
- </td>
- </tr>
- </table>
- <note label="Long Columns"> A column is long if it can't fit on a single page. A long column
+ </td>
+ </tr>
+ </table>
+ <note label="Long Columns"> A column is long if it can't fit on a single page. A long column
+
is marked as long in the base row, and it's field contains a pointer
+
to a chain of other rows in the same container with contain the data
+
of the row. Each of the subsequent rows is on a page to itself. Each
+
subsquent row, except for the last piece has 2 columns, the first
+
is the next segment of the row and the second is the pointer to the
+
the following segment. The last segment only has the data segment.
+
</note>
-
</section>
- <section id="slottable">
+ <section id="slottable">
<title>Slot Offset Table</title>
<p>The slot offset table is a table of 6 or 12 bytes per record, depending
+
on the pageSize being less or greater than 64K: </p>
- <table>
- <caption>Slot Table Record</caption>
- <tr>
- <th>Size</th>
- <th>Content</th>
- </tr>
- <tr>
- <td>2 bytes (unsigned short) or 4 bytes (int)</td>
- <td>page offset for the record that is assigned to the slot</td>
- </tr>
- <tr>
- <td>2 bytes (unsigned short) or 4 bytes (int)</td>
- <td>the length of the record on this page.</td>
- </tr>
- <tr>
- <td>2 bytes (unsigned short) or 4 bytes (int)</td>
- <td>the length of the reserved number of bytes for this record on
+ <table>
+ <caption>Slot Table Record</caption>
+ <tr>
+ <th>Size</th>
+ <th>Content</th>
+ </tr>
+ <tr>
+ <td>2 bytes (unsigned short) or 4 bytes (int)</td>
+ <td>page offset for the record that is assigned to the slot</td>
+ </tr>
+ <tr>
+ <td>2 bytes (unsigned short) or 4 bytes (int)</td>
+ <td>the length of the record on this page.</td>
+ </tr>
+ <tr>
+ <td>2 bytes (unsigned short) or 4 bytes (int)</td>
+ <td>the length of the reserved number of bytes for this record on
+
this page.</td>
- </tr>
- </table>
- <p>
+ </tr>
+ </table>
+ <p>
+
First slot is slot 0. The slot table grows backwards. Slots are never
+
left empty. </p>
</section>
- <section id="checksum">
+ <section id="checksum">
<title>Checksum</title>
<p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents
+
without the 8 bytes representing the checksum.</p>
</section>
- </section>
- <section id="allocpage">
+ </section>
+ <section id="allocpage">
<title>Allocation Page</title>
<p> An allocation page of the file container extends a normal Stored page,
+
with the exception that a hunk of space may be 'borrowed' by the file
+
container to store the file header.</p>
<p> The borrowed space is not visible to the alloc page even though it is
+
present in the page data array. It is accessed directly by the FileContainer.
+
Any change made to the borrowed space is not managed or seen by the allocation
+
page.</p>
<p> The reason for having this borrowed space is so that the container header
+
does not need to have a page of its own. </p>
- <p>
+ <p>
<strong>Page Format</strong>
<br/>
+
An allocation page extends a stored page, the on disk format is different
+
from a stored page in that N bytes are 'borrowed' by the container and
+
the page header of an allocation page will be slightly bigger than a normal
+
stored page. This N bytes are stored between the page header and the record
+
space.</p>
<p> The reason why this N bytes can't simply be a row is because it needs
+
to be statically accessible by the container object to avoid a chicken
+
and egg problem of the container object needing to instantiate an alloc
+
page object before it can be objectified, and an alloc page object needing
+
to instantiate a container object before it can be objectified. So this
+
N bytes must be stored outside of the normal record interface yet it must
+
be settable because only the first alloc page has this borrowed space.
+
Other (non-first) alloc page have N == 0.
- <br/>
- <img src="alloc-page.png" alt=""/>
+
+ <br/><img alt="" src="alloc-page.png"/></p>
+ <p>
+
+ N is a byte that indicates the size of the borrowed space. Once an alloc
+
+ page is initialized, the value of N cannot change.
+
+ </p>
+ <p>
+
+ The maximum space that can be borrowed by the container is 256 bytes.
+
+ </p>
+ <p>
+
+ The allocation pages are of the same page size as any other pages in the
+
+ container. The first allocation page of the FileContainer starts at the
+
+ first physical byte of the container. Subsequent allocation pages are
+
+ chained via the nextAllocPageOffset. Each allocation page is expected to
+
+ manage at least 1000 user pages (for 1K page size) so this chaining may not
+
+ be a severe performance hit. The logical -> physical mapping of an
+
+ allocation page is stored in the previous allocation page. The container
+
+ object will need to maintain this mapping.</p>
+ <p>
+
+ The following fields are stored in the page header:
+
+ </p>
+ <table>
+ <caption>
+
+ Format of Alloc Page
+
+ </caption>
+ <tr>
+ <th>
+
+ Type
+
+ </th>
+ <th>
+
+ Description
+
+ </th>
+ </tr>
+ <tr>
+ <td>
+
+ int
+
+ </td>
+ <td>
+
+ FormatId
+
+ </td>
+ </tr>
+ <tr>
+ <td>StoredPageHeader</td>
+ <td>see <a href="#storedpage">Stored Page Header</a></td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>nextAllocPageNumber - the next allocation page's number</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>nextAllocPageOffset - the file offset of the next allocation page</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>reserved1 - reserved for future usage</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>reserved2 - reserved for future usage</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>reserved3 - reserved for future usage</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>reserved4 - reserved for future usage</td>
+ </tr>
+ <tr>
+ <td>byte</td>
+ <td>N - the size of the borrowed container info</td>
+ </tr>
+ <tr>
+ <td>byte[N]</td>
+ <td>containerInfo - the content of the borrowed container info</td>
+ </tr>
+ <tr>
+ <td>AllocExtent</td>
+ <td>The one and only extent on this alloc page.</td>
+ </tr>
+ </table>
+ <p>
+
+ The allocation page contains allocation extent rows. In this first cut
+
+ implementation, there is only 1 allocation extent row per allocation page.
+
+ </p>
+ <p>
+
+ The allocation extent row is an externalizable object and is directly
+
+ written on to the page by the alloc page. In other words, it will not be
+
+ converted in to a storeableRow. This is to cut down overhead, enhance
+
+ performance and gives more control of the size and layout of the allocation
+
+ extent row to the alloc page.
+
+ </p>
+ <section>
+ <title>
+
+ Alloc Page detailed implementation notes</title>
+ <p>
+
+ Create Container - an embryonic allocation page is formatted on disk by a
+
+ special static function to avoid instantiating a full AllocPage object.
+
+ This embryonic allocation has enough information that it can find the
+
+ file header and not much else. Then the allocation page is properly
+
+ initialized by creating the first extent.
+
+ </p>
+ <p>
+
+ Open Container - A static AllocPage method will be used to read off the
+
+ container information directly from disk. Even if
+
+ the first alloc page (page 0) is already in the page cache, it will not be
+
+ used because cleaning the alloc page will introduce a deadlock if the
+
+ container is not in the container cache. Long term, the first alloc page
+
+ should probably live in the container cache rather than in the page cache.
+
+ </p>
+ <p>
+
+ Get Page - The first alloc page (page 0) will be read into the page cache.
+
+ Continue to follow the alloc page chain until the alloc page that manages
+
+ the specified page is found. From the alloc page, the physical offset of
+
+ the specified page is located.
+
+ </p>
+ <p>
+
+ Cleaning alloc page - the alloc page is written out the same way any page
+
+ is written out. The container object will provide a call back to the alloc
+
+ page to write the current version of the container object back into the
+
+ borrowed space before the alloc page itself is written out.
+
+ </p>
+ <p>
+
+ Cleaning the container object - get the the first alloc page, dirty it and
+
+ clean it (which will cause it to call the container object to write itself
+
+ out into the borrowed space). The versioning of the container is
+
+ independent of the versioning of the alloc page. The container version is
+
+ stored inside the borrowed space and is opaque to the alloc page.
+
+ </p>
+ <p>For the fields in an allocation extent row.</p>
+ </section>
+ </section>
+ <section>
+ <title>Allocation Extent</title>
+ <p>
+
+ An allocation extent row manages the page status of page in the extent.
+
+ AllocExtent is externalizable and is written to the AllocPage directly,
+
+ without being converted to a row first.
+
+ </p>
+ <table>
+ <caption>Format of Allocation Extent</caption>
+ <tr>
+ <th>Type</th>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>extentOffset - the begin physical byte offset of the first page of this extent</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>extentStart - the first logical page mananged by this extent.</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>extentEnd - the last page this extent can ever hope to manage.</td>
+ </tr>
+ <tr>
+ <td>int</td>
+ <td>extentLength - the number of pages allocated in this extent</td>
+ </tr>
+ <tr>
+ <td>int</td>
+ <td>
+ <p>extentStatus - status bits for the whole extent.
+
+ <br/>HAS_DEALLOCATED - most likely, this extent has a deallocated
+
+ page somewhere. If !HAD_DEALLOCATED, the extent has no deallocated page.
+
+ <br/>HAS_FREE - most likely, this extent has a free page somewhere.
+
+ If !HAS_FREE, there is no free page in the extent.
+
+ <br/>ALL_FREE - most likely, this extent only has free pages, good
+
+ candidate for shrinking the file.
+
+ If !ALL_FREE, the extent is not all free.
+
+ <br/>HAS_UNFILLED_PAGES - most likely, this extent has unfilled pages.
+
+ if !HAS_UNFILLED_PAGES, all pages are filled.
+
+ <br/>KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages
+
+ (post v1.3). If not set, this extent has no notion of
+
+ unfilled page and has no unFilledPage bitmap.
+
+ <br/>NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a
+
+ free page bit maps. Prior to 2.0, there are 2 bit
+
+ maps, a deallocate page bit map and a free page bit
+
+ map. Cloudscape 2.0 and later merged the dealloc page
+
+ bit map into the free page bit map.
+
+ <br/>RETIRED - this extent contains only 'retired' pages, never use
+
+ any page from this extent. The pages don't actually
+
+ exist, i.e., it maps to nothing (physicalOffset is
+
+ garbage). The purpose of this extent is to blot out a
+
+ range of logical page numbers that no longer exists
+
+ for this container. Use this to reuse a physical page
+
+ when a logical page has exhausted all recordId or for
+
+ logical pages that has been shrunk out.
+
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td>int</td>
+ <td>preAllocLength - the number of pages that have been preallocated</td>
+ </tr>
+ <tr>
+ <td>int</td>
+ <td>reserved1</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>reserved2 - reserved for future use</td>
+ </tr>
+ <tr>
+ <td>long</td>
+ <td>reserved3 - reserved for future use</td>
+ </tr>
+ <tr>
+ <td>FreePages(bit)</td>
+ <td>Bitmap of free pages. Bit[i] is ON if page is free for immediate (re)use.</td>
+ </tr>
+ <tr>
+ <td>unFilledPages(bit)</td>
+ <td>Bitmap of pages that have free space. Bit[i] is ON if page is likely to be < 1/2 full.</td>
+ </tr>
+ </table>
+ <p>
+
+ org.apache.derby.iapi.services.io.FormatableBitSet is used to store the bit map.
+
+ FormatableBitSet is an externalizable class.
+
+ </p>
+ <p>
+
+ A page can have the following logical state:
+
+ <br/>Free - a page that is free to be used
+
+ <br/>Valid - a page that is currently in use
+
+ </p>
+ <p>
+
+ There is another type of transitional pages which pages that have been
+
+ allocated on disk but has not yet been used. These pages are Free.
+
+ </p>
+ <p>
+
+ Bit[K] freePages
+
+ Bit[i] is ON iff page i maybe free for reuse. User must get the
+
+ dealloc page lock on the free page to make sure the transaction.
+
+ </p>
+ <p>
+
+ K is the size of the bit array, it must be >= length.
+
</p>
</section>
</body>
- <footer>
- <legal>This is a legal notice, so it is
- <strong>important</strong>
- .</legal>
+ <footer>
+ <legal></legal>
</footer>
</document>