You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@db.apache.org by rh...@apache.org on 2012/12/19 19:20:28 UTC
svn commit: r843115 [25/44] - in /websites/production/db/content/derby: ./ binaries/ blogs/ blogs/images/ dev/ docs/ images/ integrate/ integrate/plugin_help/ integrate/plugin_help/images/ logo/ manuals/ papers/ papers/DerbyTut/ releases/ skin/ skin/cs...

Added: websites/production/db/content/derby/papers/logformats.html
==============================================================================
--- websites/production/db/content/derby/papers/logformats.html (added)
+++ websites/production/db/content/derby/papers/logformats.html Wed Dec 19 18:20:21 2012
@@ -0,0 +1,967 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.8">
+<meta name="Forrest-skin-name" content="pelt">
+<title>Derby Write Ahead Log Format</title>
+<link type="text/css" href="../skin/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="../skin/screen.css" rel="stylesheet">
+<link media="print" type="text/css" href="../skin/print.css" rel="stylesheet">
+<link type="text/css" href="../skin/profile.css" rel="stylesheet">
+<script src="../skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="../skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="../skin/fontsize.js" language="javascript" type="text/javascript"></script>
+<link rel="shortcut icon" href="../">
+</head>
+<body onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+<a href="http://www.apache.org/">apache</a> &gt; <a href="http://db.apache.org/">db</a><script src="../skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
+</div>
+<!--+
+    |header
+    +-->
+<div class="header">
+<!--+
+    |start group logo
+    +-->
+<div class="grouplogo">
+<a href="http://db.apache.org/derby"><img class="logoImage" alt="Apache Derby" src="../images/derby-logo-web.png" title="Derby is a zero-admin Java RDBMS"></a>
+</div>
+<!--+
+    |end group logo
+    +-->
+<!--+
+    |start Project Logo
+    +-->
+<div class="projectlogoA1">
+<a href="http://db.apache.org"><img class="logoImage" alt="Apache DB Project" src="../images/db-logo-white.png" title="Apache DB creates and maintains database solutions."></a>
+</div>
+<!--+
+    |end Project Logo
+    +-->
+<!--+
+    |start Tabs
+    +-->
+<ul id="tabs">
+<li>
+<a class="unselected" href="../index.html">Home</a>
+</li>
+<li>
+<a class="unselected" href="../quick_start.html">Quick Start</a>
+</li>
+<li>
+<a class="unselected" href="../derby_downloads.html">Download</a>
+</li>
+<li>
+<a class="unselected" href="../derby_comm.html">Community</a>
+</li>
+<li>
+<a class="unselected" href="../manuals/index.html">Documentation</a>
+</li>
+<li class="current">
+<a class="selected" href="../blogs/index.html">Resources</a>
+</li>
+</ul>
+<!--+
+    |end Tabs
+    +-->
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<!--+
+    |start Subtabs
+    +-->
+<div id="level2tabs"></div>
+<!--+
+    |end Endtabs
+    +-->
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>
+</div>
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+
+             &nbsp;
+           </div>
+<!--+
+    |start Menu, mainarea
+    +-->
+<!--+
+    |start Menu
+    +-->
+<div id="menu">
+<div onclick="SwitchMenu('menu_1.1', '../skin/')" id="menu_1.1Title" class="menutitle">Blogs and Articles About Derby</div>
+<div id="menu_1.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#blogs">Blogs</a>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3', '../skin/')" id="menu_1.1.3Title" class="menutitle">Articles</div>
+<div id="menu_1.1.3" class="menuitemgroup">
+<div onclick="SwitchMenu('menu_1.1.3.1', '../skin/')" id="menu_1.1.3.1Title" class="menutitle">Tutorials, Tips and Tuning</div>
+<div id="menu_1.1.3.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#getstarted">Getting Started</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#features">Features, Hints and Tips</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#security">Security</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#performance">Performance and Tuning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.2', '../skin/')" id="menu_1.1.3.2Title" class="menutitle">Tools and Migration</div>
+<div id="menu_1.1.3.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#migration">Migration</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.3', '../skin/')" id="menu_1.1.3.3Title" class="menutitle">Applications</div>
+<div id="menu_1.1.3.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#client">Client</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#middletier">Middle Tier</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#persistence">Persistence</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#scalability">Scalability and Failover</a>
+</div>
+</div>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.2', '../skin/')" id="menu_1.2Title" class="menutitle">Integration With Other Products</div>
+<div id="menu_1.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#uses">What works with Derby?</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#products">Product Writeups</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.3', '../skin/')" id="menu_1.3Title" class="menutitle">Eclipse Plug-ins</div>
+<div id="menu_1.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/derby_plugin_info.html">Info</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4', '../skin/')" id="menu_selected_1.4Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Papers and Presentations</div>
+<div id="menu_selected_1.4" class="selectedmenuitemgroup" style="display: block;">
+<div class="menuitem">
+<a href="../papers/index.html">Overview</a>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4.2', '../skin/')" id="menu_selected_1.4.2Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Derby Engine</div>
+<div id="menu_selected_1.4.2" class="selectedmenuitemgroup" style="display: block;">
+<div onclick="SwitchMenu('menu_1.4.2.1', '../skin/')" id="menu_1.4.2.1Title" class="menutitle">Javadoc</div>
+<div id="menu_1.4.2.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine">Engine</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/language">Language</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/publishedapi">API</a>
+</div>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_arch.html">Architecture</a>
+</div>
+<div class="menuitem">
+<a href="../papers/btree_package.html">BTree</a>
+</div>
+<div class="menuitem">
+<a href="../papers/pageformats.html">Disk Page Format</a>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_htw.html">How Things Work</a>
+</div>
+<div class="menuitem">
+<a href="../papers/Intersect-design.html">Intersect &amp; Except</a>
+</div>
+<div class="menuitem">
+<a href="../papers/JDBCImplementation.html">JDBC</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">Log Format</div>
+</div>
+<div class="menuitem">
+<a href="../papers/recovery.html">Logging &amp; Recovery</a>
+</div>
+<div class="menuitem">
+<a href="../papers/optimizer.html">Optimizer</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/types/package-summary.html#package_description">Type System</a>
+</div>
+<div class="menuitem">
+<a href="../papers/versionupgrade.html">Versioning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.3', '../skin/')" id="menu_1.4.3Title" class="menutitle">Derby Network Client</div>
+<div id="menu_1.4.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyClientSpec.html">Functional Spec</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.4', '../skin/')" id="menu_1.4.4Title" class="menutitle">Derby Tutorial</div>
+<div id="menu_1.4.4" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyTut/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/install_software.html">Step 1: Install Software</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ij_intro.html">Step 2: ij Basics</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/embedded_intro.html">Step 3: Embedded Derby</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ns_intro.html">Step 4: Derby Network Server</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.5', '../skin/')" id="menu_1.4.5Title" class="menutitle">Presentations</div>
+<div id="menu_1.4.5" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/ApacheCon.html">ApacheCon</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Victorian+Java+User+Group">Victorian JUG 2008</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#OSCON+2005">OSCON 2005</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Colorado+Software+Summit+2004">Colorado 2004</a>
+</div>
+</div>
+</div>
+<!--+
+    |start Search
+    +-->
+<div class="searchbox">
+<hr>
+<form action="http://www.google.com/search" method="get">
+<input value="db.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="18" name="q" id="query" type="text" value="Search the site with google">&nbsp; 
+                  <input name="Search" value="Search" type="submit">
+</form>
+</div>
+<!--+
+    |end search
+    +-->
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" height="15" width="15" alt="" src="../skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
+<!--+
+  |alternative credits
+  +-->
+<div id="credit2"></div>
+</div>
+<!--+
+    |end Menu
+    +-->
+<!--+
+    |start content
+    +-->
+<div id="content">
+<div class="trail">Font size: 
+	          &nbsp;<input value="Reset" class="resetfont" title="Reset text" onclick="ndeSetTextSize('reset'); return false;" type="button">      
+	          &nbsp;<input value="-a" class="smallerfont" title="Shrink text" onclick="ndeSetTextSize('decr'); return false;" type="button">
+	          &nbsp;<input value="+a" class="biggerfont" title="Enlarge text" onclick="ndeSetTextSize('incr'); return false;" type="button">
+</div>
+<h1>Derby Write Ahead Log Format</h1>
+<div class="abstract">This document describes the storage format of Derby Write Ahead 
+        Log. This is a work-in-progress derived from Javadoc comments and from 
+        explanations Mike Matrigali and others posted to the Derby lists. Please 
+        post questions, comments, and corrections to derby-dev@db.apache.org. 
+      </div>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#introduction"> Introduction </a>
+</li>
+<li>
+<a href="#References"> References </a>
+</li>
+<li>
+<a href="#Derby+implementation+of+the+Write+Ahead+Log">Derby implementation of the Write Ahead Log</a>
+<ul class="minitoc">
+<li>
+<a href="#LogCounter">LogCounter</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Format+of+Write+Ahead+Log"> Format of Write Ahead Log </a>
+<ul class="minitoc">
+<li>
+<a href="#Format+of+Log+Control+File">Format of Log Control File</a>
+</li>
+<li>
+<a href="#Format+of+the+log+file">Format of the log file</a>
+</li>
+<li>
+<a href="#Format+of+the+log+record+wrapper">Format of the log record wrapper</a>
+</li>
+<li>
+<a href="#The+format+of+a+log+record">The format of a log record</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Pointers+to+relevant+classes">Pointers to relevant classes</a>
+</li>
+</ul>
+</div> 
+      
+<a name="N10010"></a><a name="introduction"></a>
+<h2 class="boxed"> Introduction </h2>
+<div class="section">
+<p> Derby uses a Write Ahead Log to record all changes to the database. 
+          The Write Ahead Log (WAL) protocol requires the following rules to be 
+          followed: </p>
+<ol> 
+          
+<li>A page must be latched exclusively before it can be updated.</li>
+          
+<li>While the latch is held, the update must be logged, and page must 
+            be tagged with the identity of the log record (often known as Log 
+            Sequence Number or LSN)</li>
+          
+<li>When the page is about to be written to persistent storage, all 
+            logs records up to and including the page's LSN, must be forced to 
+            disk.</li>
+          
+<li>Once the log records have been forced to disk, the cached page may 
+            be written to persistent storage, overwriting the previous version 
+            of the page.</li>
+        
+</ol>
+<p>The WAL protocol ensures that in the event of a system crash, databases 
+          pages can be restored to a consistent state using the information contained 
+          in the log records. How this is done will be the subject of another 
+          paper.</p>
+</div>
+      
+<a name="N1002C"></a><a name="References"></a>
+<h2 class="boxed"> References </h2>
+<div class="section">
+<p> A good description of Write Ahead Logging, and how a log is typically 
+          implemented, can be found in 
+          <em> 
+            <a class="external" href="http://portal.acm.org/citation.cfm?id=573304">Transaction 
+              Processing: Concepts and Techniques</a>
+            , by Jim Gray and Andreas Reuter, 1993, Morgan Kaufmann Publishers</em>
+          .</p>
+</div>
+      
+<a name="N1003D"></a><a name="Derby+implementation+of+the+Write+Ahead+Log"></a>
+<h2 class="boxed">Derby implementation of the Write Ahead Log</h2>
+<div class="section">
+<p> Derby implements the Write Ahead Log using a non-circular file system 
+          file. Here are some comments about current implementation of recovery:</p>
+<p class="quote">
+          
+<em>Suresh Thalamati</em>
+<br>
+          Derby supports simple media recovery. It has support for full backup/restore 
+          and very basic form of rollforward recovery (replay of logs using backup 
+          and archived log files). </p>
+<p class="quote">
+          
+<em>Mike Matrigali</em>
+<br>
+            1. Derby fully supports crash recovery, it uses java to correctly 
+              sync the log file to support this.<br>
+            2. I would say derby supports media recovery. One can make a backup 
+              of the data and store it off line. Logs can be stored on a separate 
+              disk from the data, and if you lose your data disk then you can 
+              use rollforward recovery on the existing logs and the copy of the 
+              backup to bring your database up to the current point in time.<br>
+            3. Derby does not support "point in time recovery". Someone may want 
+              to look at this in the future. Technically I don't think it would 
+              be very hard as the logging system has the stuff to solve the hard 
+              problems. It does not have an idea about "time" - it just knows 
+              log sequence numbers, so need to figure out what kind of interface 
+              a user really wants. A very user unfriendly interface would not 
+              be very hard to implement which would be recover to a specific log 
+              sequence number. Anyone interested in this feature should add it 
+              to jira - I'll be happy to add technical comments on what needs 
+              to be done.<br>
+            4. A reasonable next step in derby recovery progress would be to 
+              add a way to automatically move/copy log files offline as they are 
+              not needed by crash recovery and only needed for media recovery. 
+              Some sort of java stored procedure callout would seem most appropriate.
+        </p>
+<p> The 'log' is a stream of log records. The 'log' is implemented as 
+          a series of numbered log files. These numbered log files are logically 
+          continuous so a transaction can have log records that span multiple 
+          log files. A single log record cannot span more than one log file. The 
+          log file number is monotonically increasing. </p>
+<p> The log belongs to a log factory of a RawStore. In the current implementation, 
+          each RawStore only has one log factory, so each RawStore only has one 
+          log (which is composed of multiple log files). At any given time, a 
+          log factory only writes new log records to one log file, this log file 
+          is called the 'current log file'. </p>
+<p> A log file is named log 
+          <em>logNumber</em>
+          .dat </p>
+<p>With the default values, a new log file is created (this is known as 
+          log switch) when a log file grows beyond 1MB and a checkpoint happens 
+          when the amount of log written is 10MB or more from the last checkpoint.</p>
+<p> RawStore exposes a checkpoint method which clients can call, or a 
+          checkpoint is taken automatically by the RawStore when: </p>
+<ol> 
+          
+<li> The log file grows beyond a certain size (configurable, default 
+            1MB)</li>
+          
+<li> RawStore is shutdown and a checkpoint hasn't been done "for a while"</li>
+          
+<li> RawStore is recovered and a checkpoint hasn't been done "for a 
+            while"</li>
+        
+</ol>
+<a name="N1007C"></a><a name="LogCounter"></a>
+<h3 class="boxed">LogCounter</h3>
+<p>Log records are identified using LogCounter, which is an implementation 
+            of LogInstant, a Derby term for LSN. The LogCounter is made up of 
+            the log file number, and the byte offset of the log record within 
+            the log file. Within the stored log record a log counter is represented 
+            as a long. Outside the LogFactory the instant is passed around as 
+            a LogCounter (through its LogInstant interface).</p>
+<p> The way the long is encoded is such that &lt; == &gt; correctly 
+            tells if one log instant is lessThan, equals or greater than another.</p>
+</div>
+      
+<a name="N1008A"></a><a name="Format+of+Write+Ahead+Log"></a>
+<h2 class="boxed"> Format of Write Ahead Log </h2>
+<div class="section">
+<p> An implementation of file based log is in 
+          <span class="codefrag">org.apache.derby.impl.store.raw.log.LogToFile</span>. 
+          This LogFactory is responsible for the formats of 2 kinds of file: 
+          the log file and the log control file. And it is responsible for the 
+          format of the log record wrapper. </p>
+<a name="N10096"></a><a name="Format+of+Log+Control+File"></a>
+<h3 class="boxed">Format of Log Control File</h3>
+<p>The log control file contains information about which log files are 
+            present and where the last checkpoint log record is located.</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4"> 
+            
+<tr> 
+              
+<th colspan="1" rowspan="1">Type</th>
+              <th colspan="1" rowspan="1">Desciption</th>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">format id set to FILE_STREAM_LOG_FILE</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">obsolete log file version</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">long</td>
+              <td colspan="1" rowspan="1">the log instant (LogCounter) of the last completed checkpoint</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">JBMS (older name for Cloudscape/Derby) version</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">checkpoint interval</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">long</td>
+              <td colspan="1" rowspan="1">spare (value set to 0)</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">long</td>
+              <td colspan="1" rowspan="1">spare (value set to 0)</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">long</td>
+              <td colspan="1" rowspan="1">spare (value set to 0)</td>
+            
+</tr>
+          
+</table>
+<a name="N10118"></a><a name="Format+of+the+log+file"></a>
+<h3 class="boxed">Format of the log file</h3>
+<p>The log file contains log records which record all the changes to 
+            the database. The complete transaction log is composed of a series 
+            of log files.</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4"> 
+            
+<tr> 
+              
+<th colspan="1" rowspan="1">Type</th>
+              <th colspan="1" rowspan="1">Description</th>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">Format id of this log file, set to FILE_STREAM_LOG_FILE.</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">Obsolete log file version - not used</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">long</td>
+              <td colspan="1" rowspan="1">Log file number - this number orders the log files in a series 
+                to form the complete transaction log </td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">long</td>
+              <td colspan="1" rowspan="1">PrevLogRecord - log instant of the previous log record, in the 
+                previous log file.</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">[log record wrapper]*</td>
+              <td colspan="1" rowspan="1">one or more log records with wrapper</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">EndMarker - value of zero. The beginning of a log record wrapper 
+                is the length of the log record, therefore it is never zero </td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">[int fuzzy end]*</td>
+              <td colspan="1" rowspan="1">zero or more int's of value 0, in case this log file has been 
+                recovered and any incomplete log record set to zero. </td>
+            
+</tr>
+          
+</table>
+<a name="N1018D"></a><a name="Format+of+the+log+record+wrapper"></a>
+<h3 class="boxed">Format of the log record wrapper</h3>
+<p>The log record wrapper provides information for the log scan.</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4"> 
+            
+<tr> 
+              
+<th colspan="1" rowspan="1">Type</th>
+              <th colspan="1" rowspan="1">Description</th>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">length - length of the log record (for forward scan)</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">long</td>
+              <td colspan="1" rowspan="1">instant - LogInstant of the log record</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">byte[length]</td>
+              <td colspan="1" rowspan="1">logRecord - byte array that is written by the FileLogger</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">length - length of the log record (for backward scan)</td>
+            
+</tr>
+          
+</table>
+<a name="N101DB"></a><a name="The+format+of+a+log+record"></a>
+<h3 class="boxed">The format of a log record</h3>
+<p>The log record described every change to the persistent store</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4"> 
+            
+<tr> 
+              
+<th colspan="1" rowspan="1">Type</th>
+              <th colspan="1" rowspan="1">Description</th>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">int</td>
+              <td colspan="1" rowspan="1">format_id, set to LOG_RECORD. The formatId is written by FormatIdOutputStream 
+                when this object is written out by writeObject </td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">CompressedInt</td>
+              <td colspan="1" rowspan="1"> 
+<p>loggable group - the loggable's group value.</p> 
+<p> Each 
+                  loggable belongs to one or more groups of similar functionality. 
+                </p> 
+<p> Grouping is a way to quickly sort out log records that 
+                  are interesting to different modules or different implementations. 
+                </p> 
+<p> When a module makes loggable and sent it to the log file, 
+                  it must mark this loggable with one or more of the following 
+                  group. If none fit, or if the loggable encompasses functionality 
+                  that is not described in existing groups, then a new group should 
+                  be introduced. </p> 
+<p> Grouping has no effect on how the record 
+                  is logged or how it is treated in rollback or recovery. </p> 
+                
+<p> The following groups are defined. This list serves as the 
+                  registry of all loggable groups. </p> 
+<table class="ForrestTable" cellspacing="1" cellpadding="4"> 
+                  
+<caption>Loggable Groups</caption>
+                  
+<tr> 
+                    
+<th colspan="1" rowspan="1">Name</th>
+                    <th colspan="1" rowspan="1">Value</th>
+                    <th colspan="1" rowspan="1">Description</th>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">FIRST</td>
+                    <td colspan="1" rowspan="1">0x1</td>
+                    <td colspan="1" rowspan="1">The first operation of a transaction.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">LAST</td>
+                    <td colspan="1" rowspan="1">0x2</td>
+                    <td colspan="1" rowspan="1">The last operation of a transaction.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">COMPENSATION</td>
+                    <td colspan="1" rowspan="1">0x4</td>
+                    <td colspan="1" rowspan="1">A compensation log record.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">BI_LOG</td>
+                    <td colspan="1" rowspan="1">0x8</td>
+                    <td colspan="1" rowspan="1">A BeforeImage log record.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">COMMIT</td>
+                    <td colspan="1" rowspan="1">0x10</td>
+                    <td colspan="1" rowspan="1">The transaction committed.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">ABORT</td>
+                    <td colspan="1" rowspan="1">0x20</td>
+                    <td colspan="1" rowspan="1">The transaction aborted.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">PREPARE</td>
+                    <td colspan="1" rowspan="1">0x40</td>
+                    <td colspan="1" rowspan="1">The transaction prepared.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">XA_NEEDLOCK</td>
+                    <td colspan="1" rowspan="1">0x80</td>
+                    <td colspan="1" rowspan="1">Need to reclaim locks associated with theis log record 
+                      during XA prepared xact recovery.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">RAWSTORE</td>
+                    <td colspan="1" rowspan="1">0x100</td>
+                    <td colspan="1" rowspan="1">A log record generated by the raw store.</td>
+                  
+</tr>
+                  
+<tr> 
+                    
+<td colspan="1" rowspan="1">FILE_RESOURCE</td>
+                    <td colspan="1" rowspan="1">0x400</td>
+                    <td colspan="1" rowspan="1">related to "non-transactional" files.</td>
+                  
+</tr>
+                
+</table> 
+</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">TransactionId</td>
+              <td colspan="1" rowspan="1">xactId - The Transaction this log belongs to.</td>
+            
+</tr>
+            
+<tr> 
+              
+<td colspan="1" rowspan="1">Loggable</td>
+              <td colspan="1" rowspan="1">op - the log operation</td>
+            
+</tr>
+          
+</table>
+</div>
+      
+<a name="N10308"></a><a name="Pointers+to+relevant+classes"></a>
+<h2 class="boxed">Pointers to relevant classes</h2>
+<div class="section">
+<div class="fixme">
+<div class="label">Fixme (DM)</div>
+<div class="content">This section should link to appropriate Javadoc documentation</div>
+</div>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+         
+<tr>
+             
+<th colspan="1" rowspan="1">Package</th>
+             <th colspan="1" rowspan="1">Class</th>
+             <th colspan="1" rowspan="1">Description</th>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1">org.apache.derby.iapi.store.raw.log</td>
+             <td colspan="1" rowspan="1">LogFactory.java</td>
+             <td colspan="1" rowspan="1">The java interface for logging system module.</td>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1">org.apache.derby.impl.store.raw.log</td>
+             <td colspan="1" rowspan="1">LogToFile.java</td>
+             <td colspan="1" rowspan="1">The implmentation of the LogFactory.java, also implementing Module,
+                 this is the one with recovery code.</td>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1"></td>
+             <td colspan="1" rowspan="1">CheckpointOperation.java</td>
+             <td colspan="1" rowspan="1">A Log Operation that represents a checkpoint.</td>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1"></td>
+             <td colspan="1" rowspan="1">FileLogger.java</td>
+             <td colspan="1" rowspan="1">Deals with putting log records to disk. Writes log records to a log file as a stream
+                (ie. log records added to the end of the file, no concept of pages).</td>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1"></td>
+             <td colspan="1" rowspan="1">FlushedScan.java</td>
+             <td colspan="1" rowspan="1">Deals with scanning the log file. Scan the the log which is implemented by a series of log files.
+                 This log scan knows how to move across log file if it is positioned at
+                 the boundary of a log file and needs to getNextRecord.</td>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1"></td>
+             <td colspan="1" rowspan="1">FlushedScanHandle.java</td>
+             <td colspan="1" rowspan="1">More stuff dealing with scanning the log file.</td>
+         
+</tr>
+         
+<tr>
+              
+<td colspan="1" rowspan="1"></td>
+              <td colspan="1" rowspan="1">Scan.java</td>
+              <td colspan="1" rowspan="1">More scan log file stuff. Scan the the log which is implemented by a series of log files.
+                This log scan knows how to move across log file if it is positioned at
+                the boundary of a log file and needs to getNextRecord.</td>
+         
+</tr>
+         
+<tr>
+              
+<td colspan="1" rowspan="1"></td>
+              <td colspan="1" rowspan="1">StreamLogScan.java</td>
+              <td colspan="1" rowspan="1">More scan log file stuff. LogScan provides methods to read a log record and get its LogInstant
+                  in an already defined scan.</td>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1"></td>
+             <td colspan="1" rowspan="1">LogAccessFile.java</td>
+             <td colspan="1" rowspan="1">Lowest level putting log records to disk. Wraps a RandomAccessFile file to provide buffering
+                 on log writes.</td>
+         
+</tr>
+         
+<tr>
+             
+<td colspan="1" rowspan="1"></td>
+             <td colspan="1" rowspan="1">LogAccessFileBuffer.java</td>
+             <td colspan="1" rowspan="1">Utility for LogAccessFile. A single buffer of data.</td>
+         
+</tr>
+         
+<tr>
+              
+<td colspan="1" rowspan="1"></td>
+              <td colspan="1" rowspan="1">LogCounter.java</td>
+              <td colspan="1" rowspan="1">Log sequence number (LSN) implementation </td>
+         
+</tr>
+         
+<tr>
+              
+<td colspan="1" rowspan="1"></td>
+              <td colspan="1" rowspan="1">LogRecord.java</td>
+              <td colspan="1" rowspan="1">The log record written out to disk.</td>
+         
+</tr>
+         
+<tr>
+              
+<td colspan="1" rowspan="1"></td>
+              <td colspan="1" rowspan="1">ReadOnly.java</td>
+              <td colspan="1" rowspan="1">an alternate read only implementation of LogFactory</td>
+         
+</tr>
+         
+</table>
+</div>
+    
+</div>
+<!--+
+    |end content
+    +-->
+<div class="clearboth">&nbsp;</div>
+</div>
+<div id="footer">
+<!--+
+    |start bottomstrip
+    +-->
+<div class="lastmodified">
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>
+</div>
+<div class="copyright">
+        Copyright &copy;
+         2004-2012 Apache Software Foundation</div>
+<div id="feedback">
+    Send feedback about the website to:
+  <a id="feedbackto" href="mailto:derby-user@db.apache.org?subject=Feedback%C2%A0papers/logformats.html">derby-user@db.apache.org</a>
+</div>
+<!--+
+    |end bottomstrip
+    +-->
+</div>
+</body>
+</html>

Added: websites/production/db/content/derby/papers/optimizer.html
==============================================================================
--- websites/production/db/content/derby/papers/optimizer.html (added)
+++ websites/production/db/content/derby/papers/optimizer.html Wed Dec 19 18:20:21 2012
@@ -0,0 +1,631 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.8">
+<meta name="Forrest-skin-name" content="pelt">
+<title>Derby Optimizer Design</title>
+<link type="text/css" href="../skin/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="../skin/screen.css" rel="stylesheet">
+<link media="print" type="text/css" href="../skin/print.css" rel="stylesheet">
+<link type="text/css" href="../skin/profile.css" rel="stylesheet">
+<script src="../skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="../skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="../skin/fontsize.js" language="javascript" type="text/javascript"></script>
+<link rel="shortcut icon" href="../">
+</head>
+<body onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+<a href="http://www.apache.org/">apache</a> &gt; <a href="http://db.apache.org/">db</a><script src="../skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
+</div>
+<!--+
+    |header
+    +-->
+<div class="header">
+<!--+
+    |start group logo
+    +-->
+<div class="grouplogo">
+<a href="http://db.apache.org/derby"><img class="logoImage" alt="Apache Derby" src="../images/derby-logo-web.png" title="Derby is a zero-admin Java RDBMS"></a>
+</div>
+<!--+
+    |end group logo
+    +-->
+<!--+
+    |start Project Logo
+    +-->
+<div class="projectlogoA1">
+<a href="http://db.apache.org"><img class="logoImage" alt="Apache DB Project" src="../images/db-logo-white.png" title="Apache DB creates and maintains database solutions."></a>
+</div>
+<!--+
+    |end Project Logo
+    +-->
+<!--+
+    |start Tabs
+    +-->
+<ul id="tabs">
+<li>
+<a class="unselected" href="../index.html">Home</a>
+</li>
+<li>
+<a class="unselected" href="../quick_start.html">Quick Start</a>
+</li>
+<li>
+<a class="unselected" href="../derby_downloads.html">Download</a>
+</li>
+<li>
+<a class="unselected" href="../derby_comm.html">Community</a>
+</li>
+<li>
+<a class="unselected" href="../manuals/index.html">Documentation</a>
+</li>
+<li class="current">
+<a class="selected" href="../blogs/index.html">Resources</a>
+</li>
+</ul>
+<!--+
+    |end Tabs
+    +-->
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<!--+
+    |start Subtabs
+    +-->
+<div id="level2tabs"></div>
+<!--+
+    |end Endtabs
+    +-->
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>
+</div>
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+
+             &nbsp;
+           </div>
+<!--+
+    |start Menu, mainarea
+    +-->
+<!--+
+    |start Menu
+    +-->
+<div id="menu">
+<div onclick="SwitchMenu('menu_1.1', '../skin/')" id="menu_1.1Title" class="menutitle">Blogs and Articles About Derby</div>
+<div id="menu_1.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#blogs">Blogs</a>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3', '../skin/')" id="menu_1.1.3Title" class="menutitle">Articles</div>
+<div id="menu_1.1.3" class="menuitemgroup">
+<div onclick="SwitchMenu('menu_1.1.3.1', '../skin/')" id="menu_1.1.3.1Title" class="menutitle">Tutorials, Tips and Tuning</div>
+<div id="menu_1.1.3.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#getstarted">Getting Started</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#features">Features, Hints and Tips</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#security">Security</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#performance">Performance and Tuning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.2', '../skin/')" id="menu_1.1.3.2Title" class="menutitle">Tools and Migration</div>
+<div id="menu_1.1.3.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#migration">Migration</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.3', '../skin/')" id="menu_1.1.3.3Title" class="menutitle">Applications</div>
+<div id="menu_1.1.3.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#client">Client</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#middletier">Middle Tier</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#persistence">Persistence</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#scalability">Scalability and Failover</a>
+</div>
+</div>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.2', '../skin/')" id="menu_1.2Title" class="menutitle">Integration With Other Products</div>
+<div id="menu_1.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#uses">What works with Derby?</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#products">Product Writeups</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.3', '../skin/')" id="menu_1.3Title" class="menutitle">Eclipse Plug-ins</div>
+<div id="menu_1.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/derby_plugin_info.html">Info</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4', '../skin/')" id="menu_selected_1.4Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Papers and Presentations</div>
+<div id="menu_selected_1.4" class="selectedmenuitemgroup" style="display: block;">
+<div class="menuitem">
+<a href="../papers/index.html">Overview</a>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4.2', '../skin/')" id="menu_selected_1.4.2Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Derby Engine</div>
+<div id="menu_selected_1.4.2" class="selectedmenuitemgroup" style="display: block;">
+<div onclick="SwitchMenu('menu_1.4.2.1', '../skin/')" id="menu_1.4.2.1Title" class="menutitle">Javadoc</div>
+<div id="menu_1.4.2.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine">Engine</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/language">Language</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/publishedapi">API</a>
+</div>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_arch.html">Architecture</a>
+</div>
+<div class="menuitem">
+<a href="../papers/btree_package.html">BTree</a>
+</div>
+<div class="menuitem">
+<a href="../papers/pageformats.html">Disk Page Format</a>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_htw.html">How Things Work</a>
+</div>
+<div class="menuitem">
+<a href="../papers/Intersect-design.html">Intersect &amp; Except</a>
+</div>
+<div class="menuitem">
+<a href="../papers/JDBCImplementation.html">JDBC</a>
+</div>
+<div class="menuitem">
+<a href="../papers/logformats.html">Log Format</a>
+</div>
+<div class="menuitem">
+<a href="../papers/recovery.html">Logging &amp; Recovery</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">Optimizer</div>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/types/package-summary.html#package_description">Type System</a>
+</div>
+<div class="menuitem">
+<a href="../papers/versionupgrade.html">Versioning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.3', '../skin/')" id="menu_1.4.3Title" class="menutitle">Derby Network Client</div>
+<div id="menu_1.4.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyClientSpec.html">Functional Spec</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.4', '../skin/')" id="menu_1.4.4Title" class="menutitle">Derby Tutorial</div>
+<div id="menu_1.4.4" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyTut/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/install_software.html">Step 1: Install Software</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ij_intro.html">Step 2: ij Basics</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/embedded_intro.html">Step 3: Embedded Derby</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ns_intro.html">Step 4: Derby Network Server</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.5', '../skin/')" id="menu_1.4.5Title" class="menutitle">Presentations</div>
+<div id="menu_1.4.5" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/ApacheCon.html">ApacheCon</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Victorian+Java+User+Group">Victorian JUG 2008</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#OSCON+2005">OSCON 2005</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Colorado+Software+Summit+2004">Colorado 2004</a>
+</div>
+</div>
+</div>
+<!--+
+    |start Search
+    +-->
+<div class="searchbox">
+<hr>
+<form action="http://www.google.com/search" method="get">
+<input value="db.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="18" name="q" id="query" type="text" value="Search the site with google">&nbsp; 
+                  <input name="Search" value="Search" type="submit">
+</form>
+</div>
+<!--+
+    |end search
+    +-->
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" height="15" width="15" alt="" src="../skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
+<!--+
+  |alternative credits
+  +-->
+<div id="credit2"></div>
+</div>
+<!--+
+    |end Menu
+    +-->
+<!--+
+    |start content
+    +-->
+<div id="content">
+<div class="trail">Font size: 
+	          &nbsp;<input value="Reset" class="resetfont" title="Reset text" onclick="ndeSetTextSize('reset'); return false;" type="button">      
+	          &nbsp;<input value="-a" class="smallerfont" title="Shrink text" onclick="ndeSetTextSize('decr'); return false;" type="button">
+	          &nbsp;<input value="+a" class="biggerfont" title="Enlarge text" onclick="ndeSetTextSize('incr'); return false;" type="button">
+</div>
+<h1>Derby Optimizer Design</h1>
+<div class="abstract">This document describes the Derby Optimizer. This is a work-in-progress 
+        derived from Javadoc comments and from explanations Jeffrey Lichtman and 
+        others posted to the Derby lists. Please post questions, comments, and 
+        corrections to derby-dev@db.apache.org. </div>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#overview">Overview</a>
+</li>
+<li>
+<a href="#Example+of+a+5-way+Join">Example of a 5-way Join</a>
+</li>
+<li>
+<a href="#Potential+Improvements+to+the+Optimizer">Potential Improvements to the Optimizer</a>
+</li>
+</ul>
+</div> 
+      
+<a name="N10010"></a><a name="overview"></a>
+<h2 class="boxed">Overview</h2>
+<div class="section">
+<div class="note">
+<div class="label">Note</div>
+<div class="content">Jeffrey Lichtman wrote the original implementation of the optimizer.
+          What follows below is a description of the optimizer he posted to the
+          Derby mailing list.
+        </div>
+</div>
+<p> The optimizer currently only considers left-deep trees. That is, when 
+          looking at joins, it doesn't consider join nodes to the right of other 
+          join nodes - the join trees it considers are skewed to the left. I thought 
+          this would be a good first implementation of the optimizer. Bushy trees 
+          (where the join tree can take any shape) are harder to implement but 
+          are often useful for big multi-way decision-support joins. </p>
+<p> During optimization the join order is represented as an array of Optimizables. 
+          The first element in the array is the leftmost table in the join tree, 
+          and the successive elements in the array are the joining tables in the 
+          left-deep tree. </p>
+<p> The optimizer does an exhaustive search of query plans, going through 
+          all the join orders and, for each join order, all of the access paths 
+          for that order. It remembers the cheapest complete plan it has found 
+          so far, and does cost-based search space pruning. That is, it doesn't 
+          consider plans that are more expensive that the best plan found so far. 
+        </p>
+<p> The optimizer goes through the search space depth-first. In other 
+          words, it adds a table to the join order, then goes through all the 
+          access paths for that table (choosing the cheapest one) before going 
+          on to the next position in the join order. When it gets all the way 
+          to the end of the join order, with all of the tables accounted for, 
+          it considers whether the plan it is looking at is the best one found 
+          so far. If so, it remembers it. Eventually, it exhausts all the tables 
+          to consider at a given depth in the join order, and it has to back up 
+          to the previous position in the join order and consider the next table 
+          there. </p>
+<p> Every time the optimizer adds a table to the prospective join order, 
+          it figures out which parts of the WHERE clause can be evaluated at that 
+          position in the order and pushes these expressions down to that position 
+          so they can be used there. For example, if you have a five-way join 
+          of tables t1, t2, t3, t4 and t5, and the current permutation being considered 
+          is t3 - t1 - t2 (with t3 as the outermost table and t2 as the innermost), 
+          it can evaluate the join "t3.c1 = t2.c2" at the third position in the 
+          join order, so when it adds table t2 it pushes the expression down to 
+          that level in the order. Later, when it removes t2 from the join order 
+          to consider some other table there, it pulls the expression back out 
+          of the join order. </p>
+<p> getNextPermutation() does the adding (and deletion) of tables to the 
+          join order under consideration by the optimizer. getNextDecoratedPermutation() 
+          goes through all the access paths for the current permutation (in the 
+          current implementation, it only has to consider the access paths for 
+          the innermost table in the join order, as the search is done depth-first). 
+          You can think of a decorated permutation as a join order with things 
+          like indexes and join strategies "decorating" the nodes. </p>
+<p> The Optimizable interface represents anything in the query that can 
+          have an access path. In practice an Optimizable is usually a base table, 
+          but it can be anything that appears in a FROM list (for example, standard 
+          SQL allows a UNION in the FROM list). Different Optimizables have different 
+          choices for access paths. The optimizer uses the nextAccessPath() method 
+          on Optimizable to step through the different access paths. These include 
+          different join strategies (such as nested loop and hash join) and different 
+          conglomerates (the base table and the different indexes). Sometimes 
+          the Optimizable has to decide whether a given access path is feasible 
+          (for example, hash join is only feasible for equijoins). </p>
+<p> I'm leaving out a whole lot of details, like how the optimizer does 
+          costing for sort avoidance and how it handles join order dependencies 
+         (e.g. when an "exists" or "in" subquery is flattened to an existence join, 
+         the join order musn't be inverted under the current implementation). </p>
+</div>
+      
+<a name="N10032"></a><a name="Example+of+a+5-way+Join"></a>
+<h2 class="boxed">Example of a 5-way Join</h2>
+<div class="section">
+<p> The optimizer looks at so many 
+          potential plans in a five-way join that it's not feasible to show all 
+          of them in an manually-written explanation. </p>
+<p> Let's take the following query: </p>
+<pre class="code"> 
+select *
+from t1, t2, t3, t4, t5
+where t1.c1 = t2.c2
+and    t1.c3 = t3.c4
+and    t3.c5 = t4.c6
+and    t4.c7 = t5.c8
+and    t1.c9 = 1
+and    t3.c10 = 2
+and    t5.c11 = 3          
+        </pre>
+<p> One possible way to execute this query is to take the tables in the 
+          order of the FROM clause. For each row in a table, join it with the 
+          matching rows from the next table to form a set of joined row. The plan 
+          would look something like this (I hope the formatting doesn't get screwed 
+          up): </p>
+<pre class="code"> 
+             JOIN
+             /      \
+          JOIN    t5
+           /      \
+        JOIN    t4
+        /     \
+    JOIN    t3
+   /       \
+t1       t2
+        </pre>
+<p> This is a left-deep tree. That is. it's skewed to the left. Let's 
+          assume for the sake of argument that each JOIN node is a nested-loop 
+          join. What this means is that each JOIN node gets a row from its left 
+          (outer) table and probes into its right (inner) table to find all the 
+          matching rows. For all but the leftmost JOIN node, the outer table is 
+          also a JOIN. So, at execution, this plan goes all the way down to the 
+          left, gets the first qualifying row from t1, then finds a matching row 
+          in t2. It then puts the matching rows from t1 and t2 together into a 
+          joined row and feeds it up to the JOIN node above it. This JOIN node 
+          uses its outer row to probe into t3 to find a matching row. When it 
+          finds such a row, it puts together its outer and inner rows into a joined 
+          row, which it feeds to the JOIN node above it. It keeps doing this all 
+          the way up the plan. When the top JOIN node finds a matching row in 
+          t5, it returns that row from the SELECT statement. </p>
+<p> More sophisticated optimizers consider "bushy" trees, which can take 
+          shapes other than the left-deep shape shown above. For example, it might 
+          consider a plan with the following join tree: </p>
+<pre class="code">
+          JOIN
+         /       \
+   JOIN       JOIN
+   /     \      /      \
+t1    t2    t3    JOIN
+                     /     \
+                    t4    t5
+        </pre>
+<p> As you can see, the tables are in the same order but the shape of 
+          the join tree is entirely different. As I mentioned in my original mail, 
+          bushy trees are harder to implement but they are good for some types 
+          of big decision-support queries. </p>
+<p> Because the Derby optimizer only models left-deep trees, it doesn't 
+          have to model the shape of the tree. All it has to model is the order 
+          of the tables in the tree (since the tree is always the same shape for 
+          a given number of tables). It does this the simple way: by using an 
+          array representing the assignment of tables to positions in the join 
+          order. </p>
+<p> The basic idea of a cost-based optimizer is to come up with an estimated 
+          cost for all the possible execution plans for a query and to choose 
+          the cheapest plan. The number of possible plans grows with the number 
+          of tables, indexes, join strategies, etc. Most optimizers do something 
+          to reduce the search space, so that for big queries the best plan (or 
+          a reasonable plan) is found in an acceptable length of time. One way 
+          the Derby optimizer prunes its search space is by skipping over plans 
+          that it knows will be more expensive than the best plan it's found so 
+          far. </p>
+<p> The optimizer does this by depth-first searching. That is, rather 
+          than coming up with a join order for all the tables in the query and 
+          then considering all the access paths for those tables, it adds one 
+          table at a time to the join order and figures out the best access path 
+          for that table (in its current spot in the join order) before going 
+          on to add another table to the join order. While doing this, it keeps 
+          track of the cost of the plan its considering so far. If, when it adds 
+          a table to the join order, it finds that this makes the current plan 
+          under consideration more costly than the best plan found so far, it 
+          abandons the consideration of that table in that position of the join 
+          order. By doing this, the optimizer can avoid considering many join 
+          orders. This is important when there are a lot of tables in the query, 
+          because the number of join orders is the factorial of the number of 
+          tables. </p>
+<p> For example, let's say that in the sample query given above, the optimizer 
+          has already found a complete plan with an estimated cost of 10000. Now 
+          suppose it is considering the following partial join order: </p>
+<p> (outer) t4 - t5 (inner) </p>
+<p> Let's say this partial plan has an estimated cost of 2000. Now suppose 
+          the optimizer considers placing the table t1 as the next table in the 
+          join order: </p>
+<p> (outer) t4 - t5 - t2 (inner) </p>
+<p> Note that the query has no direct join clause between t1 and either 
+          t4 or t5. The optimizer would go through all possible access paths for 
+          t2 in this context, and would see that with no useful qualification 
+          on the table it would have to do a full scan of t2 for every outer row 
+          resulting from the join of t4 and t5. If t2 is anything but a very small 
+          table, it could be expensive. Let's say the estimated total best cost 
+          for t2 in this position in the join order is 50000. That would make 
+          the total cost of the query equal to 52000, which is higher than the 
+          cost of the best plan found so far (10000). So it doesn't make sense 
+          to look at this join order any further. Rather than consider the following 
+          join orders: </p>
+<p> (outer) t4 - t5 - t2 - t1 - t3 (inner) </p>
+<p> (outer) t4 - t5 - t2 - t3 - t1 (inner) </p>
+<p> the optimizer abandons consideration of any plan starting with t4 
+          - t5 - t2. </p>
+</div>
+      
+<a name="N10078"></a><a name="Potential+Improvements+to+the+Optimizer"></a>
+<h2 class="boxed">Potential Improvements to the Optimizer</h2>
+<div class="section">
+<p> It's hard to consider the optimizer by itself. Many optimizer enhancements
+          would work with changes in other areas, especially execution. </p>
+<p> One area that I think needs work is hash joins. The current implementation 
+          uses an in-memory hash table. The optimizer avoids using the hash join 
+          strategy when it estimates that the hash table would use too much memory. 
+          There are a few problems with this: the optimizer doesn't really know 
+          how much memory is available, its estimates of memory usage are crude 
+          and possibly inaccurate, and it's possible for the query to fail if 
+          a hash table gets too big during execution. </p>
+<p> I would like to see hash tables that spill to disk. Ideally, the hash 
+          table should be an in-memory structure with a conglomerate as a backing 
+          store. I would want the backing store to be used only when necessary 
+          - that is, only when the hash table grows too big. The biggest problem 
+          with this idea is how to estimate the cost of building and maintaining 
+          the table. One approach could be to put a limit on the number of in-memory 
+          rows in the hash table and use a statistical formula for the cost of 
+          reading a row, using the number of in-memory rows versus the total number 
+          of rows to estimate the chances of finding a row in memory. </p>
+<p> Another approach could be to use weak references in managing the hash 
+          table (a weak reference is a Java construct that allows the garbage 
+          collector to nominate an object for garbage collection even when it 
+          has a reference to it). Weak references are useful for memory caches 
+          that adjust themselves to the memory available to the JVM. One of our 
+          original ideas for Derby (nee Cloudscape) is that it should be a low-maintenance 
+          DBMS, with little intervention required to keep a working system running. 
+          A self-managing cache could help with this - it would adjust itself 
+          to environments with different amounts of memory (although small-memory 
+          environments could hurt performance). I don't know how the optimizer 
+          would estimate the cost for building and maintaining a hash table in 
+          this case. </p>
+<p> I also think merge joins are worth considering, especially if nothing 
+          is done about hash joins. Merge joins are useful for many of the same 
+          types of queries as hash joins and, since they use the sorter (assuming 
+          the joined rows are not already ordered on the join colums) they can 
+          work even for large tables (because the sorter spills to disk if the 
+          data being sorted won't fit in memory). Merge joins can have a very 
+          low cost if the rows are already ordered (which they can be if there 
+          are indexes on the join columns). Merge joins can also work well with 
+          sort avoidance if the ordering for the merge is the same as for the 
+          ORDER BY clause. </p>
+<p> Switching gears, another problem is the cost of user-defined functions. 
+          What do you do with a query like this?: </p>
+<pre class="code"> 
+select *
+from t1, t2, t3
+where t1.c1 = user_defined_function(t2.c2)
+and t2.c3 = t3.c4
+        </pre>
+<p> If the user-defined function is cheap and there's an index on t1.c1, 
+          you may want to call the function for every row in t2 and use the result 
+          to probe into the index on t1. On the other hand, if the function is 
+          expensive, you may want to try to execute it as few times as possible, 
+          which could make it unfeasible to use it to probe into t1. Currently 
+          Derby has no modeling for the cost of user-defined functions and avoids 
+          pushing them down into the query plan (that is, it calculates user-defined 
+          functions as late as possible before returning the rows from the query). 
+        </p>
+<p> This may seem trivial, but keep in mind that a user-defined function 
+          can do anything, from something as simple as returning a constant to 
+          as complex as executing a query on another DBMS. It really can be important 
+          to know the cost of a user-defined function. </p>
+<p> One possible approach would be to have a way of telling the DBMS to 
+          execute the function and remember how long it takes, and then store 
+          this in a system catalog for the optimizer to use. Another approach 
+          would be to allow the user to register the cost of a function as low, 
+          medium or high. </p>
+<p> Switching gears again, another feature I think would be generically 
+          useful would be indexes on expressions (instead of just on columns). 
+          One potential use for this feature is case-independent searches (which 
+          can be done now, but which tend to be slow because the functions involved 
+          prevent the optimizer from using an index). The biggest problem here 
+          is the representation of index definitions in the SYSCONGLOMERATES system 
+          table (which assumes that an index column corresponds to a column in 
+          a base table). </p>
+<p> Another area for investigation is the flattening of subqueries to 
+          joins. Currently, the language compiler flattens IN and EXISTS subqueries 
+          to types of joins. This is good because it gives the optimizer more 
+          options in choosing query plans for these types of subqueries, and it 
+          also allows the optimizer to get better cost estimates (for complex 
+          technical reasons I won't go into here). There are other types of subqueries 
+          that could be flattened - for example, a NOT IN or NOT EXISTS subquery 
+          can often be flattened to an outer join. </p>
+<p> Another thing that could be done is to allow the optimizer to invert 
+          the join order of IN and EXISTS subqueries. As mentioned above, these 
+          subqueries are often flattened to joins. The joins are special, in that 
+          at execution it looks for only one match in the inner table per row 
+          from the outer table. This strategy requires that the table in the IN 
+          or EXISTS subquery remain inner to the joining table from outside the 
+          subquery. It would be possible to invert the join order if a sort were 
+          done on the subquery table to eliminate duplicate joining values. (Actually, 
+          I'm oversimplifying here, but I would have to write a treatise otherwise.) 
+        </p>
+</div>
+    
+</div>
+<!--+
+    |end content
+    +-->
+<div class="clearboth">&nbsp;</div>
+</div>
+<div id="footer">
+<!--+
+    |start bottomstrip
+    +-->
+<div class="lastmodified">
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>
+</div>
+<div class="copyright">
+        Copyright &copy;
+         2004-2012 Apache Software Foundation</div>
+<div id="feedback">
+    Send feedback about the website to:
+  <a id="feedbackto" href="mailto:derby-user@db.apache.org?subject=Feedback%C2%A0papers/optimizer.html">derby-user@db.apache.org</a>
+</div>
+<!--+
+    |end bottomstrip
+    +-->
+</div>
+</body>
+</html>

Added: websites/production/db/content/derby/papers/page-format.png
==============================================================================
Binary file - no diff available.

Propchange: websites/production/db/content/derby/papers/page-format.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream