You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@db.apache.org by rh...@apache.org on 2012/12/19 19:20:28 UTC
svn commit: r843115 [25/44] - in /websites/production/db/content/derby: ./
binaries/ blogs/ blogs/images/ dev/ docs/ images/ integrate/
integrate/plugin_help/ integrate/plugin_help/images/ logo/ manuals/ papers/
papers/DerbyTut/ releases/ skin/ skin/cs...
Added: websites/production/db/content/derby/papers/logformats.html
==============================================================================
--- websites/production/db/content/derby/papers/logformats.html (added)
+++ websites/production/db/content/derby/papers/logformats.html Wed Dec 19 18:20:21 2012
@@ -0,0 +1,967 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.8">
+<meta name="Forrest-skin-name" content="pelt">
+<title>Derby Write Ahead Log Format</title>
+<link type="text/css" href="../skin/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="../skin/screen.css" rel="stylesheet">
+<link media="print" type="text/css" href="../skin/print.css" rel="stylesheet">
+<link type="text/css" href="../skin/profile.css" rel="stylesheet">
+<script src="../skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="../skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="../skin/fontsize.js" language="javascript" type="text/javascript"></script>
+<link rel="shortcut icon" href="../">
+</head>
+<body onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<!--+
+ |breadtrail
+ +-->
+<div class="breadtrail">
+<a href="http://www.apache.org/">apache</a> > <a href="http://db.apache.org/">db</a><script src="../skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
+</div>
+<!--+
+ |header
+ +-->
+<div class="header">
+<!--+
+ |start group logo
+ +-->
+<div class="grouplogo">
+<a href="http://db.apache.org/derby"><img class="logoImage" alt="Apache Derby" src="../images/derby-logo-web.png" title="Derby is a zero-admin Java RDBMS"></a>
+</div>
+<!--+
+ |end group logo
+ +-->
+<!--+
+ |start Project Logo
+ +-->
+<div class="projectlogoA1">
+<a href="http://db.apache.org"><img class="logoImage" alt="Apache DB Project" src="../images/db-logo-white.png" title="Apache DB creates and maintains database solutions."></a>
+</div>
+<!--+
+ |end Project Logo
+ +-->
+<!--+
+ |start Tabs
+ +-->
+<ul id="tabs">
+<li>
+<a class="unselected" href="../index.html">Home</a>
+</li>
+<li>
+<a class="unselected" href="../quick_start.html">Quick Start</a>
+</li>
+<li>
+<a class="unselected" href="../derby_downloads.html">Download</a>
+</li>
+<li>
+<a class="unselected" href="../derby_comm.html">Community</a>
+</li>
+<li>
+<a class="unselected" href="../manuals/index.html">Documentation</a>
+</li>
+<li class="current">
+<a class="selected" href="../blogs/index.html">Resources</a>
+</li>
+</ul>
+<!--+
+ |end Tabs
+ +-->
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<!--+
+ |start Subtabs
+ +-->
+<div id="level2tabs"></div>
+<!--+
+ |end Endtabs
+ +-->
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+// --></script>
+</div>
+<!--+
+ |breadtrail
+ +-->
+<div class="breadtrail">
+
+
+ </div>
+<!--+
+ |start Menu, mainarea
+ +-->
+<!--+
+ |start Menu
+ +-->
+<div id="menu">
+<div onclick="SwitchMenu('menu_1.1', '../skin/')" id="menu_1.1Title" class="menutitle">Blogs and Articles About Derby</div>
+<div id="menu_1.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#blogs">Blogs</a>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3', '../skin/')" id="menu_1.1.3Title" class="menutitle">Articles</div>
+<div id="menu_1.1.3" class="menuitemgroup">
+<div onclick="SwitchMenu('menu_1.1.3.1', '../skin/')" id="menu_1.1.3.1Title" class="menutitle">Tutorials, Tips and Tuning</div>
+<div id="menu_1.1.3.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#getstarted">Getting Started</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#features">Features, Hints and Tips</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#security">Security</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#performance">Performance and Tuning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.2', '../skin/')" id="menu_1.1.3.2Title" class="menutitle">Tools and Migration</div>
+<div id="menu_1.1.3.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#migration">Migration</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.3', '../skin/')" id="menu_1.1.3.3Title" class="menutitle">Applications</div>
+<div id="menu_1.1.3.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#client">Client</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#middletier">Middle Tier</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#persistence">Persistence</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#scalability">Scalability and Failover</a>
+</div>
+</div>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.2', '../skin/')" id="menu_1.2Title" class="menutitle">Integration With Other Products</div>
+<div id="menu_1.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#uses">What works with Derby?</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#products">Product Writeups</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.3', '../skin/')" id="menu_1.3Title" class="menutitle">Eclipse Plug-ins</div>
+<div id="menu_1.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/derby_plugin_info.html">Info</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4', '../skin/')" id="menu_selected_1.4Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Papers and Presentations</div>
+<div id="menu_selected_1.4" class="selectedmenuitemgroup" style="display: block;">
+<div class="menuitem">
+<a href="../papers/index.html">Overview</a>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4.2', '../skin/')" id="menu_selected_1.4.2Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Derby Engine</div>
+<div id="menu_selected_1.4.2" class="selectedmenuitemgroup" style="display: block;">
+<div onclick="SwitchMenu('menu_1.4.2.1', '../skin/')" id="menu_1.4.2.1Title" class="menutitle">Javadoc</div>
+<div id="menu_1.4.2.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine">Engine</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/language">Language</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/publishedapi">API</a>
+</div>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_arch.html">Architecture</a>
+</div>
+<div class="menuitem">
+<a href="../papers/btree_package.html">BTree</a>
+</div>
+<div class="menuitem">
+<a href="../papers/pageformats.html">Disk Page Format</a>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_htw.html">How Things Work</a>
+</div>
+<div class="menuitem">
+<a href="../papers/Intersect-design.html">Intersect & Except</a>
+</div>
+<div class="menuitem">
+<a href="../papers/JDBCImplementation.html">JDBC</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">Log Format</div>
+</div>
+<div class="menuitem">
+<a href="../papers/recovery.html">Logging & Recovery</a>
+</div>
+<div class="menuitem">
+<a href="../papers/optimizer.html">Optimizer</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/types/package-summary.html#package_description">Type System</a>
+</div>
+<div class="menuitem">
+<a href="../papers/versionupgrade.html">Versioning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.3', '../skin/')" id="menu_1.4.3Title" class="menutitle">Derby Network Client</div>
+<div id="menu_1.4.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyClientSpec.html">Functional Spec</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.4', '../skin/')" id="menu_1.4.4Title" class="menutitle">Derby Tutorial</div>
+<div id="menu_1.4.4" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyTut/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/install_software.html">Step 1: Install Software</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ij_intro.html">Step 2: ij Basics</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/embedded_intro.html">Step 3: Embedded Derby</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ns_intro.html">Step 4: Derby Network Server</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.5', '../skin/')" id="menu_1.4.5Title" class="menutitle">Presentations</div>
+<div id="menu_1.4.5" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/ApacheCon.html">ApacheCon</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Victorian+Java+User+Group">Victorian JUG 2008</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#OSCON+2005">OSCON 2005</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Colorado+Software+Summit+2004">Colorado 2004</a>
+</div>
+</div>
+</div>
+<!--+
+ |start Search
+ +-->
+<div class="searchbox">
+<hr>
+<form action="http://www.google.com/search" method="get">
+<input value="db.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="18" name="q" id="query" type="text" value="Search the site with google">
+ <input name="Search" value="Search" type="submit">
+</form>
+</div>
+<!--+
+ |end search
+ +-->
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" height="15" width="15" alt="" src="../skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
+<!--+
+ |alternative credits
+ +-->
+<div id="credit2"></div>
+</div>
+<!--+
+ |end Menu
+ +-->
+<!--+
+ |start content
+ +-->
+<div id="content">
+<div class="trail">Font size:
+ <input value="Reset" class="resetfont" title="Reset text" onclick="ndeSetTextSize('reset'); return false;" type="button">
+ <input value="-a" class="smallerfont" title="Shrink text" onclick="ndeSetTextSize('decr'); return false;" type="button">
+ <input value="+a" class="biggerfont" title="Enlarge text" onclick="ndeSetTextSize('incr'); return false;" type="button">
+</div>
+<h1>Derby Write Ahead Log Format</h1>
+<div class="abstract">This document describes the storage format of Derby Write Ahead
+ Log. This is a work-in-progress derived from Javadoc comments and from
+ explanations Mike Matrigali and others posted to the Derby lists. Please
+ post questions, comments, and corrections to derby-dev@db.apache.org.
+ </div>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#introduction"> Introduction </a>
+</li>
+<li>
+<a href="#References"> References </a>
+</li>
+<li>
+<a href="#Derby+implementation+of+the+Write+Ahead+Log">Derby implementation of the Write Ahead Log</a>
+<ul class="minitoc">
+<li>
+<a href="#LogCounter">LogCounter</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Format+of+Write+Ahead+Log"> Format of Write Ahead Log </a>
+<ul class="minitoc">
+<li>
+<a href="#Format+of+Log+Control+File">Format of Log Control File</a>
+</li>
+<li>
+<a href="#Format+of+the+log+file">Format of the log file</a>
+</li>
+<li>
+<a href="#Format+of+the+log+record+wrapper">Format of the log record wrapper</a>
+</li>
+<li>
+<a href="#The+format+of+a+log+record">The format of a log record</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Pointers+to+relevant+classes">Pointers to relevant classes</a>
+</li>
+</ul>
+</div>
+
+<a name="N10010"></a><a name="introduction"></a>
+<h2 class="boxed"> Introduction </h2>
+<div class="section">
+<p> Derby uses a Write Ahead Log to record all changes to the database.
+ The Write Ahead Log (WAL) protocol requires the following rules to be
+ followed: </p>
+<ol>
+
+<li>A page must be latched exclusively before it can be updated.</li>
+
+<li>While the latch is held, the update must be logged, and page must
+ be tagged with the identity of the log record (often known as Log
+ Sequence Number or LSN)</li>
+
+<li>When the page is about to be written to persistent storage, all
+ logs records up to and including the page's LSN, must be forced to
+ disk.</li>
+
+<li>Once the log records have been forced to disk, the cached page may
+ be written to persistent storage, overwriting the previous version
+ of the page.</li>
+
+</ol>
+<p>The WAL protocol ensures that in the event of a system crash, databases
+ pages can be restored to a consistent state using the information contained
+ in the log records. How this is done will be the subject of another
+ paper.</p>
+</div>
+
+<a name="N1002C"></a><a name="References"></a>
+<h2 class="boxed"> References </h2>
+<div class="section">
+<p> A good description of Write Ahead Logging, and how a log is typically
+ implemented, can be found in
+ <em>
+ <a class="external" href="http://portal.acm.org/citation.cfm?id=573304">Transaction
+ Processing: Concepts and Techniques</a>
+ , by Jim Gray and Andreas Reuter, 1993, Morgan Kaufmann Publishers</em>
+ .</p>
+</div>
+
+<a name="N1003D"></a><a name="Derby+implementation+of+the+Write+Ahead+Log"></a>
+<h2 class="boxed">Derby implementation of the Write Ahead Log</h2>
+<div class="section">
+<p> Derby implements the Write Ahead Log using a non-circular file system
+ file. Here are some comments about current implementation of recovery:</p>
+<p class="quote">
+
+<em>Suresh Thalamati</em>
+<br>
+ Derby supports simple media recovery. It has support for full backup/restore
+ and very basic form of rollforward recovery (replay of logs using backup
+ and archived log files). </p>
+<p class="quote">
+
+<em>Mike Matrigali</em>
+<br>
+ 1. Derby fully supports crash recovery, it uses java to correctly
+ sync the log file to support this.<br>
+ 2. I would say derby supports media recovery. One can make a backup
+ of the data and store it off line. Logs can be stored on a separate
+ disk from the data, and if you lose your data disk then you can
+ use rollforward recovery on the existing logs and the copy of the
+ backup to bring your database up to the current point in time.<br>
+ 3. Derby does not support "point in time recovery". Someone may want
+ to look at this in the future. Technically I don't think it would
+ be very hard as the logging system has the stuff to solve the hard
+ problems. It does not have an idea about "time" - it just knows
+ log sequence numbers, so need to figure out what kind of interface
+ a user really wants. A very user unfriendly interface would not
+ be very hard to implement which would be recover to a specific log
+ sequence number. Anyone interested in this feature should add it
+ to jira - I'll be happy to add technical comments on what needs
+ to be done.<br>
+ 4. A reasonable next step in derby recovery progress would be to
+ add a way to automatically move/copy log files offline as they are
+ not needed by crash recovery and only needed for media recovery.
+ Some sort of java stored procedure callout would seem most appropriate.
+ </p>
+<p> The 'log' is a stream of log records. The 'log' is implemented as
+ a series of numbered log files. These numbered log files are logically
+ continuous so a transaction can have log records that span multiple
+ log files. A single log record cannot span more than one log file. The
+ log file number is monotonically increasing. </p>
+<p> The log belongs to a log factory of a RawStore. In the current implementation,
+ each RawStore only has one log factory, so each RawStore only has one
+ log (which is composed of multiple log files). At any given time, a
+ log factory only writes new log records to one log file, this log file
+ is called the 'current log file'. </p>
+<p> A log file is named log
+ <em>logNumber</em>
+ .dat </p>
+<p>With the default values, a new log file is created (this is known as
+ log switch) when a log file grows beyond 1MB and a checkpoint happens
+ when the amount of log written is 10MB or more from the last checkpoint.</p>
+<p> RawStore exposes a checkpoint method which clients can call, or a
+ checkpoint is taken automatically by the RawStore when: </p>
+<ol>
+
+<li> The log file grows beyond a certain size (configurable, default
+ 1MB)</li>
+
+<li> RawStore is shutdown and a checkpoint hasn't been done "for a while"</li>
+
+<li> RawStore is recovered and a checkpoint hasn't been done "for a
+ while"</li>
+
+</ol>
+<a name="N1007C"></a><a name="LogCounter"></a>
+<h3 class="boxed">LogCounter</h3>
+<p>Log records are identified using LogCounter, which is an implementation
+ of LogInstant, a Derby term for LSN. The LogCounter is made up of
+ the log file number, and the byte offset of the log record within
+ the log file. Within the stored log record a log counter is represented
+ as a long. Outside the LogFactory the instant is passed around as
+ a LogCounter (through its LogInstant interface).</p>
+<p> The way the long is encoded is such that < == > correctly
+ tells if one log instant is lessThan, equals or greater than another.</p>
+</div>
+
+<a name="N1008A"></a><a name="Format+of+Write+Ahead+Log"></a>
+<h2 class="boxed"> Format of Write Ahead Log </h2>
+<div class="section">
+<p> An implementation of file based log is in
+ <span class="codefrag">org.apache.derby.impl.store.raw.log.LogToFile</span>.
+ This LogFactory is responsible for the formats of 2 kinds of file:
+ the log file and the log control file. And it is responsible for the
+ format of the log record wrapper. </p>
+<a name="N10096"></a><a name="Format+of+Log+Control+File"></a>
+<h3 class="boxed">Format of Log Control File</h3>
+<p>The log control file contains information about which log files are
+ present and where the last checkpoint log record is located.</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<tr>
+
+<th colspan="1" rowspan="1">Type</th>
+ <th colspan="1" rowspan="1">Desciption</th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">format id set to FILE_STREAM_LOG_FILE</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">obsolete log file version</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">the log instant (LogCounter) of the last completed checkpoint</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">JBMS (older name for Cloudscape/Derby) version</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">checkpoint interval</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">spare (value set to 0)</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">spare (value set to 0)</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">spare (value set to 0)</td>
+
+</tr>
+
+</table>
+<a name="N10118"></a><a name="Format+of+the+log+file"></a>
+<h3 class="boxed">Format of the log file</h3>
+<p>The log file contains log records which record all the changes to
+ the database. The complete transaction log is composed of a series
+ of log files.</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<tr>
+
+<th colspan="1" rowspan="1">Type</th>
+ <th colspan="1" rowspan="1">Description</th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">Format id of this log file, set to FILE_STREAM_LOG_FILE.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">Obsolete log file version - not used</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">Log file number - this number orders the log files in a series
+ to form the complete transaction log </td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">PrevLogRecord - log instant of the previous log record, in the
+ previous log file.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">[log record wrapper]*</td>
+ <td colspan="1" rowspan="1">one or more log records with wrapper</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">EndMarker - value of zero. The beginning of a log record wrapper
+ is the length of the log record, therefore it is never zero </td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">[int fuzzy end]*</td>
+ <td colspan="1" rowspan="1">zero or more int's of value 0, in case this log file has been
+ recovered and any incomplete log record set to zero. </td>
+
+</tr>
+
+</table>
+<a name="N1018D"></a><a name="Format+of+the+log+record+wrapper"></a>
+<h3 class="boxed">Format of the log record wrapper</h3>
+<p>The log record wrapper provides information for the log scan.</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<tr>
+
+<th colspan="1" rowspan="1">Type</th>
+ <th colspan="1" rowspan="1">Description</th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">length - length of the log record (for forward scan)</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">long</td>
+ <td colspan="1" rowspan="1">instant - LogInstant of the log record</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">byte[length]</td>
+ <td colspan="1" rowspan="1">logRecord - byte array that is written by the FileLogger</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">length - length of the log record (for backward scan)</td>
+
+</tr>
+
+</table>
+<a name="N101DB"></a><a name="The+format+of+a+log+record"></a>
+<h3 class="boxed">The format of a log record</h3>
+<p>The log record described every change to the persistent store</p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<tr>
+
+<th colspan="1" rowspan="1">Type</th>
+ <th colspan="1" rowspan="1">Description</th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">int</td>
+ <td colspan="1" rowspan="1">format_id, set to LOG_RECORD. The formatId is written by FormatIdOutputStream
+ when this object is written out by writeObject </td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">CompressedInt</td>
+ <td colspan="1" rowspan="1">
+<p>loggable group - the loggable's group value.</p>
+<p> Each
+ loggable belongs to one or more groups of similar functionality.
+ </p>
+<p> Grouping is a way to quickly sort out log records that
+ are interesting to different modules or different implementations.
+ </p>
+<p> When a module makes loggable and sent it to the log file,
+ it must mark this loggable with one or more of the following
+ group. If none fit, or if the loggable encompasses functionality
+ that is not described in existing groups, then a new group should
+ be introduced. </p>
+<p> Grouping has no effect on how the record
+ is logged or how it is treated in rollback or recovery. </p>
+
+<p> The following groups are defined. This list serves as the
+ registry of all loggable groups. </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<caption>Loggable Groups</caption>
+
+<tr>
+
+<th colspan="1" rowspan="1">Name</th>
+ <th colspan="1" rowspan="1">Value</th>
+ <th colspan="1" rowspan="1">Description</th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">FIRST</td>
+ <td colspan="1" rowspan="1">0x1</td>
+ <td colspan="1" rowspan="1">The first operation of a transaction.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">LAST</td>
+ <td colspan="1" rowspan="1">0x2</td>
+ <td colspan="1" rowspan="1">The last operation of a transaction.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">COMPENSATION</td>
+ <td colspan="1" rowspan="1">0x4</td>
+ <td colspan="1" rowspan="1">A compensation log record.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">BI_LOG</td>
+ <td colspan="1" rowspan="1">0x8</td>
+ <td colspan="1" rowspan="1">A BeforeImage log record.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">COMMIT</td>
+ <td colspan="1" rowspan="1">0x10</td>
+ <td colspan="1" rowspan="1">The transaction committed.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">ABORT</td>
+ <td colspan="1" rowspan="1">0x20</td>
+ <td colspan="1" rowspan="1">The transaction aborted.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">PREPARE</td>
+ <td colspan="1" rowspan="1">0x40</td>
+ <td colspan="1" rowspan="1">The transaction prepared.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">XA_NEEDLOCK</td>
+ <td colspan="1" rowspan="1">0x80</td>
+ <td colspan="1" rowspan="1">Need to reclaim locks associated with theis log record
+ during XA prepared xact recovery.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">RAWSTORE</td>
+ <td colspan="1" rowspan="1">0x100</td>
+ <td colspan="1" rowspan="1">A log record generated by the raw store.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">FILE_RESOURCE</td>
+ <td colspan="1" rowspan="1">0x400</td>
+ <td colspan="1" rowspan="1">related to "non-transactional" files.</td>
+
+</tr>
+
+</table>
+</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">TransactionId</td>
+ <td colspan="1" rowspan="1">xactId - The Transaction this log belongs to.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">Loggable</td>
+ <td colspan="1" rowspan="1">op - the log operation</td>
+
+</tr>
+
+</table>
+</div>
+
+<a name="N10308"></a><a name="Pointers+to+relevant+classes"></a>
+<h2 class="boxed">Pointers to relevant classes</h2>
+<div class="section">
+<div class="fixme">
+<div class="label">Fixme (DM)</div>
+<div class="content">This section should link to appropriate Javadoc documentation</div>
+</div>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+
+<tr>
+
+<th colspan="1" rowspan="1">Package</th>
+ <th colspan="1" rowspan="1">Class</th>
+ <th colspan="1" rowspan="1">Description</th>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">org.apache.derby.iapi.store.raw.log</td>
+ <td colspan="1" rowspan="1">LogFactory.java</td>
+ <td colspan="1" rowspan="1">The java interface for logging system module.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1">org.apache.derby.impl.store.raw.log</td>
+ <td colspan="1" rowspan="1">LogToFile.java</td>
+ <td colspan="1" rowspan="1">The implmentation of the LogFactory.java, also implementing Module,
+ this is the one with recovery code.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">CheckpointOperation.java</td>
+ <td colspan="1" rowspan="1">A Log Operation that represents a checkpoint.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">FileLogger.java</td>
+ <td colspan="1" rowspan="1">Deals with putting log records to disk. Writes log records to a log file as a stream
+ (ie. log records added to the end of the file, no concept of pages).</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">FlushedScan.java</td>
+ <td colspan="1" rowspan="1">Deals with scanning the log file. Scan the the log which is implemented by a series of log files.
+ This log scan knows how to move across log file if it is positioned at
+ the boundary of a log file and needs to getNextRecord.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">FlushedScanHandle.java</td>
+ <td colspan="1" rowspan="1">More stuff dealing with scanning the log file.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">Scan.java</td>
+ <td colspan="1" rowspan="1">More scan log file stuff. Scan the the log which is implemented by a series of log files.
+ This log scan knows how to move across log file if it is positioned at
+ the boundary of a log file and needs to getNextRecord.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">StreamLogScan.java</td>
+ <td colspan="1" rowspan="1">More scan log file stuff. LogScan provides methods to read a log record and get its LogInstant
+ in an already defined scan.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">LogAccessFile.java</td>
+ <td colspan="1" rowspan="1">Lowest level putting log records to disk. Wraps a RandomAccessFile file to provide buffering
+ on log writes.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">LogAccessFileBuffer.java</td>
+ <td colspan="1" rowspan="1">Utility for LogAccessFile. A single buffer of data.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">LogCounter.java</td>
+ <td colspan="1" rowspan="1">Log sequence number (LSN) implementation </td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">LogRecord.java</td>
+ <td colspan="1" rowspan="1">The log record written out to disk.</td>
+
+</tr>
+
+<tr>
+
+<td colspan="1" rowspan="1"></td>
+ <td colspan="1" rowspan="1">ReadOnly.java</td>
+ <td colspan="1" rowspan="1">an alternate read only implementation of LogFactory</td>
+
+</tr>
+
+</table>
+</div>
+
+</div>
+<!--+
+ |end content
+ +-->
+<div class="clearboth"> </div>
+</div>
+<div id="footer">
+<!--+
+ |start bottomstrip
+ +-->
+<div class="lastmodified">
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+// --></script>
+</div>
+<div class="copyright">
+ Copyright ©
+ 2004-2012 Apache Software Foundation</div>
+<div id="feedback">
+ Send feedback about the website to:
+ <a id="feedbackto" href="mailto:derby-user@db.apache.org?subject=Feedback%C2%A0papers/logformats.html">derby-user@db.apache.org</a>
+</div>
+<!--+
+ |end bottomstrip
+ +-->
+</div>
+</body>
+</html>
Added: websites/production/db/content/derby/papers/optimizer.html
==============================================================================
--- websites/production/db/content/derby/papers/optimizer.html (added)
+++ websites/production/db/content/derby/papers/optimizer.html Wed Dec 19 18:20:21 2012
@@ -0,0 +1,631 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.8">
+<meta name="Forrest-skin-name" content="pelt">
+<title>Derby Optimizer Design</title>
+<link type="text/css" href="../skin/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="../skin/screen.css" rel="stylesheet">
+<link media="print" type="text/css" href="../skin/print.css" rel="stylesheet">
+<link type="text/css" href="../skin/profile.css" rel="stylesheet">
+<script src="../skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="../skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="../skin/fontsize.js" language="javascript" type="text/javascript"></script>
+<link rel="shortcut icon" href="../">
+</head>
+<body onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<!--+
+ |breadtrail
+ +-->
+<div class="breadtrail">
+<a href="http://www.apache.org/">apache</a> > <a href="http://db.apache.org/">db</a><script src="../skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
+</div>
+<!--+
+ |header
+ +-->
+<div class="header">
+<!--+
+ |start group logo
+ +-->
+<div class="grouplogo">
+<a href="http://db.apache.org/derby"><img class="logoImage" alt="Apache Derby" src="../images/derby-logo-web.png" title="Derby is a zero-admin Java RDBMS"></a>
+</div>
+<!--+
+ |end group logo
+ +-->
+<!--+
+ |start Project Logo
+ +-->
+<div class="projectlogoA1">
+<a href="http://db.apache.org"><img class="logoImage" alt="Apache DB Project" src="../images/db-logo-white.png" title="Apache DB creates and maintains database solutions."></a>
+</div>
+<!--+
+ |end Project Logo
+ +-->
+<!--+
+ |start Tabs
+ +-->
+<ul id="tabs">
+<li>
+<a class="unselected" href="../index.html">Home</a>
+</li>
+<li>
+<a class="unselected" href="../quick_start.html">Quick Start</a>
+</li>
+<li>
+<a class="unselected" href="../derby_downloads.html">Download</a>
+</li>
+<li>
+<a class="unselected" href="../derby_comm.html">Community</a>
+</li>
+<li>
+<a class="unselected" href="../manuals/index.html">Documentation</a>
+</li>
+<li class="current">
+<a class="selected" href="../blogs/index.html">Resources</a>
+</li>
+</ul>
+<!--+
+ |end Tabs
+ +-->
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<!--+
+ |start Subtabs
+ +-->
+<div id="level2tabs"></div>
+<!--+
+ |end Endtabs
+ +-->
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+// --></script>
+</div>
+<!--+
+ |breadtrail
+ +-->
+<div class="breadtrail">
+
+
+ </div>
+<!--+
+ |start Menu, mainarea
+ +-->
+<!--+
+ |start Menu
+ +-->
+<div id="menu">
+<div onclick="SwitchMenu('menu_1.1', '../skin/')" id="menu_1.1Title" class="menutitle">Blogs and Articles About Derby</div>
+<div id="menu_1.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#blogs">Blogs</a>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3', '../skin/')" id="menu_1.1.3Title" class="menutitle">Articles</div>
+<div id="menu_1.1.3" class="menuitemgroup">
+<div onclick="SwitchMenu('menu_1.1.3.1', '../skin/')" id="menu_1.1.3.1Title" class="menutitle">Tutorials, Tips and Tuning</div>
+<div id="menu_1.1.3.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#getstarted">Getting Started</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#features">Features, Hints and Tips</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#security">Security</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#performance">Performance and Tuning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.2', '../skin/')" id="menu_1.1.3.2Title" class="menutitle">Tools and Migration</div>
+<div id="menu_1.1.3.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#migration">Migration</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.1.3.3', '../skin/')" id="menu_1.1.3.3Title" class="menutitle">Applications</div>
+<div id="menu_1.1.3.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../blogs/index.html#client">Client</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#middletier">Middle Tier</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#persistence">Persistence</a>
+</div>
+<div class="menuitem">
+<a href="../blogs/index.html#scalability">Scalability and Failover</a>
+</div>
+</div>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.2', '../skin/')" id="menu_1.2Title" class="menutitle">Integration With Other Products</div>
+<div id="menu_1.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#uses">What works with Derby?</a>
+</div>
+<div class="menuitem">
+<a href="../integrate/index.html#products">Product Writeups</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.3', '../skin/')" id="menu_1.3Title" class="menutitle">Eclipse Plug-ins</div>
+<div id="menu_1.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../integrate/derby_plugin_info.html">Info</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4', '../skin/')" id="menu_selected_1.4Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Papers and Presentations</div>
+<div id="menu_selected_1.4" class="selectedmenuitemgroup" style="display: block;">
+<div class="menuitem">
+<a href="../papers/index.html">Overview</a>
+</div>
+<div onclick="SwitchMenu('menu_selected_1.4.2', '../skin/')" id="menu_selected_1.4.2Title" class="menutitle" style="background-image: url('../skin/images/chapter_open.gif');">Derby Engine</div>
+<div id="menu_selected_1.4.2" class="selectedmenuitemgroup" style="display: block;">
+<div onclick="SwitchMenu('menu_1.4.2.1', '../skin/')" id="menu_1.4.2.1Title" class="menutitle">Javadoc</div>
+<div id="menu_1.4.2.1" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine">Engine</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/language">Language</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/tools">Tools</a>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/publishedapi">API</a>
+</div>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_arch.html">Architecture</a>
+</div>
+<div class="menuitem">
+<a href="../papers/btree_package.html">BTree</a>
+</div>
+<div class="menuitem">
+<a href="../papers/pageformats.html">Disk Page Format</a>
+</div>
+<div class="menuitem">
+<a href="../papers/derby_htw.html">How Things Work</a>
+</div>
+<div class="menuitem">
+<a href="../papers/Intersect-design.html">Intersect & Except</a>
+</div>
+<div class="menuitem">
+<a href="../papers/JDBCImplementation.html">JDBC</a>
+</div>
+<div class="menuitem">
+<a href="../papers/logformats.html">Log Format</a>
+</div>
+<div class="menuitem">
+<a href="../papers/recovery.html">Logging & Recovery</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">Optimizer</div>
+</div>
+<div class="menuitem">
+<a href="http://db.apache.org/derby/javadoc/engine/org/apache/derby/iapi/types/package-summary.html#package_description">Type System</a>
+</div>
+<div class="menuitem">
+<a href="../papers/versionupgrade.html">Versioning</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.3', '../skin/')" id="menu_1.4.3Title" class="menutitle">Derby Network Client</div>
+<div id="menu_1.4.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyClientSpec.html">Functional Spec</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.4', '../skin/')" id="menu_1.4.4Title" class="menutitle">Derby Tutorial</div>
+<div id="menu_1.4.4" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/DerbyTut/index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/install_software.html">Step 1: Install Software</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ij_intro.html">Step 2: ij Basics</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/embedded_intro.html">Step 3: Embedded Derby</a>
+</div>
+<div class="menuitem">
+<a href="../papers/DerbyTut/ns_intro.html">Step 4: Derby Network Server</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4.5', '../skin/')" id="menu_1.4.5Title" class="menutitle">Presentations</div>
+<div id="menu_1.4.5" class="menuitemgroup">
+<div class="menuitem">
+<a href="../papers/ApacheCon.html">ApacheCon</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Victorian+Java+User+Group">Victorian JUG 2008</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#OSCON+2005">OSCON 2005</a>
+</div>
+<div class="menuitem">
+<a href="../papers/MiscPresentations.html#Colorado+Software+Summit+2004">Colorado 2004</a>
+</div>
+</div>
+</div>
+<!--+
+ |start Search
+ +-->
+<div class="searchbox">
+<hr>
+<form action="http://www.google.com/search" method="get">
+<input value="db.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="18" name="q" id="query" type="text" value="Search the site with google">
+ <input name="Search" value="Search" type="submit">
+</form>
+</div>
+<!--+
+ |end search
+ +-->
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" height="15" width="15" alt="" src="../skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
+<!--+
+ |alternative credits
+ +-->
+<div id="credit2"></div>
+</div>
+<!--+
+ |end Menu
+ +-->
+<!--+
+ |start content
+ +-->
+<div id="content">
+<div class="trail">Font size:
+ <input value="Reset" class="resetfont" title="Reset text" onclick="ndeSetTextSize('reset'); return false;" type="button">
+ <input value="-a" class="smallerfont" title="Shrink text" onclick="ndeSetTextSize('decr'); return false;" type="button">
+ <input value="+a" class="biggerfont" title="Enlarge text" onclick="ndeSetTextSize('incr'); return false;" type="button">
+</div>
+<h1>Derby Optimizer Design</h1>
+<div class="abstract">This document describes the Derby Optimizer. This is a work-in-progress
+ derived from Javadoc comments and from explanations Jeffrey Lichtman and
+ others posted to the Derby lists. Please post questions, comments, and
+ corrections to derby-dev@db.apache.org. </div>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#overview">Overview</a>
+</li>
+<li>
+<a href="#Example+of+a+5-way+Join">Example of a 5-way Join</a>
+</li>
+<li>
+<a href="#Potential+Improvements+to+the+Optimizer">Potential Improvements to the Optimizer</a>
+</li>
+</ul>
+</div>
+
+<a name="N10010"></a><a name="overview"></a>
+<h2 class="boxed">Overview</h2>
+<div class="section">
+<div class="note">
+<div class="label">Note</div>
+<div class="content">Jeffrey Lichtman wrote the original implementation of the optimizer.
+ What follows below is a description of the optimizer he posted to the
+ Derby mailing list.
+ </div>
+</div>
+<p> The optimizer currently only considers left-deep trees. That is, when
+ looking at joins, it doesn't consider join nodes to the right of other
+ join nodes - the join trees it considers are skewed to the left. I thought
+ this would be a good first implementation of the optimizer. Bushy trees
+ (where the join tree can take any shape) are harder to implement but
+ are often useful for big multi-way decision-support joins. </p>
+<p> During optimization the join order is represented as an array of Optimizables.
+ The first element in the array is the leftmost table in the join tree,
+ and the successive elements in the array are the joining tables in the
+ left-deep tree. </p>
+<p> The optimizer does an exhaustive search of query plans, going through
+ all the join orders and, for each join order, all of the access paths
+ for that order. It remembers the cheapest complete plan it has found
+ so far, and does cost-based search space pruning. That is, it doesn't
+ consider plans that are more expensive that the best plan found so far.
+ </p>
+<p> The optimizer goes through the search space depth-first. In other
+ words, it adds a table to the join order, then goes through all the
+ access paths for that table (choosing the cheapest one) before going
+ on to the next position in the join order. When it gets all the way
+ to the end of the join order, with all of the tables accounted for,
+ it considers whether the plan it is looking at is the best one found
+ so far. If so, it remembers it. Eventually, it exhausts all the tables
+ to consider at a given depth in the join order, and it has to back up
+ to the previous position in the join order and consider the next table
+ there. </p>
+<p> Every time the optimizer adds a table to the prospective join order,
+ it figures out which parts of the WHERE clause can be evaluated at that
+ position in the order and pushes these expressions down to that position
+ so they can be used there. For example, if you have a five-way join
+ of tables t1, t2, t3, t4 and t5, and the current permutation being considered
+ is t3 - t1 - t2 (with t3 as the outermost table and t2 as the innermost),
+ it can evaluate the join "t3.c1 = t2.c2" at the third position in the
+ join order, so when it adds table t2 it pushes the expression down to
+ that level in the order. Later, when it removes t2 from the join order
+ to consider some other table there, it pulls the expression back out
+ of the join order. </p>
+<p> getNextPermutation() does the adding (and deletion) of tables to the
+ join order under consideration by the optimizer. getNextDecoratedPermutation()
+ goes through all the access paths for the current permutation (in the
+ current implementation, it only has to consider the access paths for
+ the innermost table in the join order, as the search is done depth-first).
+ You can think of a decorated permutation as a join order with things
+ like indexes and join strategies "decorating" the nodes. </p>
+<p> The Optimizable interface represents anything in the query that can
+ have an access path. In practice an Optimizable is usually a base table,
+ but it can be anything that appears in a FROM list (for example, standard
+ SQL allows a UNION in the FROM list). Different Optimizables have different
+ choices for access paths. The optimizer uses the nextAccessPath() method
+ on Optimizable to step through the different access paths. These include
+ different join strategies (such as nested loop and hash join) and different
+ conglomerates (the base table and the different indexes). Sometimes
+ the Optimizable has to decide whether a given access path is feasible
+ (for example, hash join is only feasible for equijoins). </p>
+<p> I'm leaving out a whole lot of details, like how the optimizer does
+ costing for sort avoidance and how it handles join order dependencies
+ (e.g. when an "exists" or "in" subquery is flattened to an existence join,
+ the join order musn't be inverted under the current implementation). </p>
+</div>
+
+<a name="N10032"></a><a name="Example+of+a+5-way+Join"></a>
+<h2 class="boxed">Example of a 5-way Join</h2>
+<div class="section">
+<p> The optimizer looks at so many
+ potential plans in a five-way join that it's not feasible to show all
+ of them in an manually-written explanation. </p>
+<p> Let's take the following query: </p>
+<pre class="code">
+select *
+from t1, t2, t3, t4, t5
+where t1.c1 = t2.c2
+and t1.c3 = t3.c4
+and t3.c5 = t4.c6
+and t4.c7 = t5.c8
+and t1.c9 = 1
+and t3.c10 = 2
+and t5.c11 = 3
+ </pre>
+<p> One possible way to execute this query is to take the tables in the
+ order of the FROM clause. For each row in a table, join it with the
+ matching rows from the next table to form a set of joined row. The plan
+ would look something like this (I hope the formatting doesn't get screwed
+ up): </p>
+<pre class="code">
+ JOIN
+ / \
+ JOIN t5
+ / \
+ JOIN t4
+ / \
+ JOIN t3
+ / \
+t1 t2
+ </pre>
+<p> This is a left-deep tree. That is. it's skewed to the left. Let's
+ assume for the sake of argument that each JOIN node is a nested-loop
+ join. What this means is that each JOIN node gets a row from its left
+ (outer) table and probes into its right (inner) table to find all the
+ matching rows. For all but the leftmost JOIN node, the outer table is
+ also a JOIN. So, at execution, this plan goes all the way down to the
+ left, gets the first qualifying row from t1, then finds a matching row
+ in t2. It then puts the matching rows from t1 and t2 together into a
+ joined row and feeds it up to the JOIN node above it. This JOIN node
+ uses its outer row to probe into t3 to find a matching row. When it
+ finds such a row, it puts together its outer and inner rows into a joined
+ row, which it feeds to the JOIN node above it. It keeps doing this all
+ the way up the plan. When the top JOIN node finds a matching row in
+ t5, it returns that row from the SELECT statement. </p>
+<p> More sophisticated optimizers consider "bushy" trees, which can take
+ shapes other than the left-deep shape shown above. For example, it might
+ consider a plan with the following join tree: </p>
+<pre class="code">
+ JOIN
+ / \
+ JOIN JOIN
+ / \ / \
+t1 t2 t3 JOIN
+ / \
+ t4 t5
+ </pre>
+<p> As you can see, the tables are in the same order but the shape of
+ the join tree is entirely different. As I mentioned in my original mail,
+ bushy trees are harder to implement but they are good for some types
+ of big decision-support queries. </p>
+<p> Because the Derby optimizer only models left-deep trees, it doesn't
+ have to model the shape of the tree. All it has to model is the order
+ of the tables in the tree (since the tree is always the same shape for
+ a given number of tables). It does this the simple way: by using an
+ array representing the assignment of tables to positions in the join
+ order. </p>
+<p> The basic idea of a cost-based optimizer is to come up with an estimated
+ cost for all the possible execution plans for a query and to choose
+ the cheapest plan. The number of possible plans grows with the number
+ of tables, indexes, join strategies, etc. Most optimizers do something
+ to reduce the search space, so that for big queries the best plan (or
+ a reasonable plan) is found in an acceptable length of time. One way
+ the Derby optimizer prunes its search space is by skipping over plans
+ that it knows will be more expensive than the best plan it's found so
+ far. </p>
+<p> The optimizer does this by depth-first searching. That is, rather
+ than coming up with a join order for all the tables in the query and
+ then considering all the access paths for those tables, it adds one
+ table at a time to the join order and figures out the best access path
+ for that table (in its current spot in the join order) before going
+ on to add another table to the join order. While doing this, it keeps
+ track of the cost of the plan its considering so far. If, when it adds
+ a table to the join order, it finds that this makes the current plan
+ under consideration more costly than the best plan found so far, it
+ abandons the consideration of that table in that position of the join
+ order. By doing this, the optimizer can avoid considering many join
+ orders. This is important when there are a lot of tables in the query,
+ because the number of join orders is the factorial of the number of
+ tables. </p>
+<p> For example, let's say that in the sample query given above, the optimizer
+ has already found a complete plan with an estimated cost of 10000. Now
+ suppose it is considering the following partial join order: </p>
+<p> (outer) t4 - t5 (inner) </p>
+<p> Let's say this partial plan has an estimated cost of 2000. Now suppose
+ the optimizer considers placing the table t1 as the next table in the
+ join order: </p>
+<p> (outer) t4 - t5 - t2 (inner) </p>
+<p> Note that the query has no direct join clause between t1 and either
+ t4 or t5. The optimizer would go through all possible access paths for
+ t2 in this context, and would see that with no useful qualification
+ on the table it would have to do a full scan of t2 for every outer row
+ resulting from the join of t4 and t5. If t2 is anything but a very small
+ table, it could be expensive. Let's say the estimated total best cost
+ for t2 in this position in the join order is 50000. That would make
+ the total cost of the query equal to 52000, which is higher than the
+ cost of the best plan found so far (10000). So it doesn't make sense
+ to look at this join order any further. Rather than consider the following
+ join orders: </p>
+<p> (outer) t4 - t5 - t2 - t1 - t3 (inner) </p>
+<p> (outer) t4 - t5 - t2 - t3 - t1 (inner) </p>
+<p> the optimizer abandons consideration of any plan starting with t4
+ - t5 - t2. </p>
+</div>
+
+<a name="N10078"></a><a name="Potential+Improvements+to+the+Optimizer"></a>
+<h2 class="boxed">Potential Improvements to the Optimizer</h2>
+<div class="section">
+<p> It's hard to consider the optimizer by itself. Many optimizer enhancements
+ would work with changes in other areas, especially execution. </p>
+<p> One area that I think needs work is hash joins. The current implementation
+ uses an in-memory hash table. The optimizer avoids using the hash join
+ strategy when it estimates that the hash table would use too much memory.
+ There are a few problems with this: the optimizer doesn't really know
+ how much memory is available, its estimates of memory usage are crude
+ and possibly inaccurate, and it's possible for the query to fail if
+ a hash table gets too big during execution. </p>
+<p> I would like to see hash tables that spill to disk. Ideally, the hash
+ table should be an in-memory structure with a conglomerate as a backing
+ store. I would want the backing store to be used only when necessary
+ - that is, only when the hash table grows too big. The biggest problem
+ with this idea is how to estimate the cost of building and maintaining
+ the table. One approach could be to put a limit on the number of in-memory
+ rows in the hash table and use a statistical formula for the cost of
+ reading a row, using the number of in-memory rows versus the total number
+ of rows to estimate the chances of finding a row in memory. </p>
+<p> Another approach could be to use weak references in managing the hash
+ table (a weak reference is a Java construct that allows the garbage
+ collector to nominate an object for garbage collection even when it
+ has a reference to it). Weak references are useful for memory caches
+ that adjust themselves to the memory available to the JVM. One of our
+ original ideas for Derby (nee Cloudscape) is that it should be a low-maintenance
+ DBMS, with little intervention required to keep a working system running.
+ A self-managing cache could help with this - it would adjust itself
+ to environments with different amounts of memory (although small-memory
+ environments could hurt performance). I don't know how the optimizer
+ would estimate the cost for building and maintaining a hash table in
+ this case. </p>
+<p> I also think merge joins are worth considering, especially if nothing
+ is done about hash joins. Merge joins are useful for many of the same
+ types of queries as hash joins and, since they use the sorter (assuming
+ the joined rows are not already ordered on the join colums) they can
+ work even for large tables (because the sorter spills to disk if the
+ data being sorted won't fit in memory). Merge joins can have a very
+ low cost if the rows are already ordered (which they can be if there
+ are indexes on the join columns). Merge joins can also work well with
+ sort avoidance if the ordering for the merge is the same as for the
+ ORDER BY clause. </p>
+<p> Switching gears, another problem is the cost of user-defined functions.
+ What do you do with a query like this?: </p>
+<pre class="code">
+select *
+from t1, t2, t3
+where t1.c1 = user_defined_function(t2.c2)
+and t2.c3 = t3.c4
+ </pre>
+<p> If the user-defined function is cheap and there's an index on t1.c1,
+ you may want to call the function for every row in t2 and use the result
+ to probe into the index on t1. On the other hand, if the function is
+ expensive, you may want to try to execute it as few times as possible,
+ which could make it unfeasible to use it to probe into t1. Currently
+ Derby has no modeling for the cost of user-defined functions and avoids
+ pushing them down into the query plan (that is, it calculates user-defined
+ functions as late as possible before returning the rows from the query).
+ </p>
+<p> This may seem trivial, but keep in mind that a user-defined function
+ can do anything, from something as simple as returning a constant to
+ as complex as executing a query on another DBMS. It really can be important
+ to know the cost of a user-defined function. </p>
+<p> One possible approach would be to have a way of telling the DBMS to
+ execute the function and remember how long it takes, and then store
+ this in a system catalog for the optimizer to use. Another approach
+ would be to allow the user to register the cost of a function as low,
+ medium or high. </p>
+<p> Switching gears again, another feature I think would be generically
+ useful would be indexes on expressions (instead of just on columns).
+ One potential use for this feature is case-independent searches (which
+ can be done now, but which tend to be slow because the functions involved
+ prevent the optimizer from using an index). The biggest problem here
+ is the representation of index definitions in the SYSCONGLOMERATES system
+ table (which assumes that an index column corresponds to a column in
+ a base table). </p>
+<p> Another area for investigation is the flattening of subqueries to
+ joins. Currently, the language compiler flattens IN and EXISTS subqueries
+ to types of joins. This is good because it gives the optimizer more
+ options in choosing query plans for these types of subqueries, and it
+ also allows the optimizer to get better cost estimates (for complex
+ technical reasons I won't go into here). There are other types of subqueries
+ that could be flattened - for example, a NOT IN or NOT EXISTS subquery
+ can often be flattened to an outer join. </p>
+<p> Another thing that could be done is to allow the optimizer to invert
+ the join order of IN and EXISTS subqueries. As mentioned above, these
+ subqueries are often flattened to joins. The joins are special, in that
+ at execution it looks for only one match in the inner table per row
+ from the outer table. This strategy requires that the table in the IN
+ or EXISTS subquery remain inner to the joining table from outside the
+ subquery. It would be possible to invert the join order if a sort were
+ done on the subquery table to eliminate duplicate joining values. (Actually,
+ I'm oversimplifying here, but I would have to write a treatise otherwise.)
+ </p>
+</div>
+
+</div>
+<!--+
+ |end content
+ +-->
+<div class="clearboth"> </div>
+</div>
+<div id="footer">
+<!--+
+ |start bottomstrip
+ +-->
+<div class="lastmodified">
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+// --></script>
+</div>
+<div class="copyright">
+ Copyright ©
+ 2004-2012 Apache Software Foundation</div>
+<div id="feedback">
+ Send feedback about the website to:
+ <a id="feedbackto" href="mailto:derby-user@db.apache.org?subject=Feedback%C2%A0papers/optimizer.html">derby-user@db.apache.org</a>
+</div>
+<!--+
+ |end bottomstrip
+ +-->
+</div>
+</body>
+</html>
Added: websites/production/db/content/derby/papers/page-format.png
==============================================================================
Binary file - no diff available.
Propchange: websites/production/db/content/derby/papers/page-format.png
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream