You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-commits@db.apache.org by ba...@apache.org on 2006/04/28 01:18:49 UTC

svn commit: r397675 - in /db/derby/code/trunk/java/engine/org/apache/derby: iapi/store/access/ impl/sql/compile/

Author: bandaram
Date: Thu Apr 27 16:18:48 2006
New Revision: 397675

URL: http://svn.apache.org/viewcvs?rev=397675&view=rev
Log:
DERBY-1007: Follow up patch to earlier submitted patch.

In a word, the fix for this issue ensures that, in the case of subqueries, the optimizer will correctly propagate the estimated costs for subqueries up to the parent subquery(-ies), thus allowing the parent query to make a better decision about which join order is ultimately the best. As seen in the example scenario included above, the correct estimates are higher--sometimes much higher--than what the optimizer was returning prior to this change: in the example, the optimizer was returning an incorrect cost estimate of 10783 before the patch, and a correct estimate of 1 million after the patch (where "correct" means that it's the value calculated by the optimizer and thus the value that should be returned; I'm not saying anything about the accuracy of the estimate here).

One side effect of this is that, for very deeply nested queries and/or queries with a high number of FROM tables/expressions, the higher cost estimates can be multiplied--sometimes many times over--throughout the optimization process, which means that the overall query estimate can climb to a much larger number much more quickly. If the query is big enough, this can actually cause the optimizer to reach an estimated cost of INFINITY.

That said, the current optimizer logic for choosing a plan does not expect to see an estimate of infinity for its plans. As a result the optimizer does comparisons of, and arithmetic with, cost estimates and row counts that, when applied to Infinity, give unexpected results.

I have filed DERBY-1259 and DERBY-1260 to address the "infinity problem" in more detail, but am attaching here a follow-up patch that takes some basic steps toward making the optimizer more robust in the face of infinite cost estimates, which are now more likely to occur given the DERBY-1007 changes. In particular, the d1007_followup_v1.patch does the following:

1) Fixes a couple of small problems with the handling of estimates for FromBaseTables, to ensure that a FromBaseTable's estimate is correctly propagated to (and handled by) the ProjectRestrictNode that sits above it. This parallels the original DERBY-1007 work but is a much simpler "follow-up" task as it deals only with base tables instead of subqueries, and thus the changes are fairly minor.

2) There are several places in OptimizerImpl where the optimizer will only choose to accept a plan's cost if the cost is less than the current "bestCost". If no best cost has been found yet, bestCost is set to an uninitialized value of Double.MAX_VALUE with the assumption that the first valid plan will have a cost less than Double.MAX_VALUE and thus will be chosen as the best so far. However, since a plan's cost estimate can actually end up being Double.POSITIVE_INFINITY, which is greater than Double.MAX_VALUE, it's possible that the optimizer will reject a valid join order because its cost is infinity, and then end up completing without ever finding a valid plan--which is wrong. What we want is for the optimizer to accept the first valid plan that it finds, regardless of what the cost is. Then if it later finds a better plan, it can use that. So in several places the d1007_followup_v1.patch adds a check to see if bestCost is uninitialized and, if so, we'll always accept the 
 first valid join order we find, regardless of what its cost is--even if it's infinity--because that's better than no plan at all.

3) Modifies the "compare" method in CostEstimateImpl.java to try to account for comparisons between two plans that both have infinite costs. If this happens, we don't have much choice but to guess as to which plan is actually better. So the changes for followup_v1 make that guess based on a comparison of row counts for the two plans. And if the row counts themselves are infinity, then we'll guess based on the single scan row counts. And finally, if those values are both infinity, as well, then we're out of luck and we just say that the two costs are "equal" for lack of better alternative.

4) And finally, due to unexpected behavior that results from arithmetic using infinity (see DERBY-1259), it is currently possible (though rather rare) for the optimizer to decide to do a hash join that has a cost estimate of Infinity. An example of a query for which this could happen can be found in DERBY-1205, query #1. That said, the BackingStoreHashtable that is used for carrying out a hash join currently creates a Java Hashtable instance with a capacity that matches the optimizer's estimated row count. So if the row count is infinity we'll try to create a Hashtable with some impossibly large capacity and, as a result, we'll end up with an OutOfMemory error. So the d1007_followup_v1.patch adds some code to handle this kind of situation in a more graceful manner.

I ran derbyall with these changes on Linux Red Hat using ibm142 and saw no new failures.

Submitted by Army Brown (qozinx@gmail.com)

Modified:
    db/derby/code/trunk/java/engine/org/apache/derby/iapi/store/access/BackingStoreHashtable.java
    db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/CostEstimateImpl.java
    db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/FromBaseTable.java
    db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/OptimizerImpl.java
    db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/ProjectRestrictNode.java

Modified: db/derby/code/trunk/java/engine/org/apache/derby/iapi/store/access/BackingStoreHashtable.java
URL: http://svn.apache.org/viewcvs/db/derby/code/trunk/java/engine/org/apache/derby/iapi/store/access/BackingStoreHashtable.java?rev=397675&r1=397674&r2=397675&view=diff
==============================================================================
--- db/derby/code/trunk/java/engine/org/apache/derby/iapi/store/access/BackingStoreHashtable.java (original)
+++ db/derby/code/trunk/java/engine/org/apache/derby/iapi/store/access/BackingStoreHashtable.java Thu Apr 27 16:18:48 2006
@@ -224,9 +224,37 @@
         }
         else
         {
+            /* We want to create the hash table based on the estimated row
+             * count if a) we have an estimated row count (i.e. it's greater
+             * than zero) and b) we think we can create a hash table to
+             * hold the estimated row count without running out of memory.
+             * The check for "b" is required because, for deeply nested
+             * queries and/or queries with a high number of tables in
+             * their FROM lists, the optimizer can end up calculating
+             * some very high row count estimates--even up to the point of
+             * Double.POSITIVE_INFINITY (see DERBY-1259 for an explanation
+             * of how that can happen).  In that case any attempts to
+             * create a Hashtable of size estimated_rowcnt can cause
+             * OutOfMemory errors when we try to create the Hashtable.
+             * So as a "red flag" for that kind of situation, we check to
+             * see if the estimated row count is greater than the max
+             * in-memory size for this table.  Unit-wise this comparison
+             * is relatively meaningless: rows vs bytes.  But if our
+             * estimated row count is greater than the max number of
+             * in-memory bytes that we're allowed to consume, then
+             * it's very likely that creating a Hashtable with a capacity
+             * of estimated_rowcnt will lead to memory problems.  So in
+             * that particular case we leave hash_table null here and
+             * initialize it further below, using the estimated in-memory
+             * size of the first row to figure out what a reasonable size
+             * for the Hashtable might be.
+             */
             hash_table = 
-                ((estimated_rowcnt <= 0) ? 
-                     new Hashtable() : new Hashtable((int) estimated_rowcnt));
+                (((estimated_rowcnt <= 0) || (row_source == null)) ?
+                     new Hashtable() :
+                     (estimated_rowcnt < max_inmemory_size) ?
+                         new Hashtable((int) estimated_rowcnt) :
+                         null);
         }
 
         if (row_source != null)
@@ -235,6 +263,22 @@
 
             while ((row = getNextRowFromRowSource()) != null)
             {
+                // If we haven't initialized the hash_table yet then that's
+                // because a Hashtable with capacity estimated_rowcnt would
+                // probably cause memory problems.  So look at the first row
+                // that we found and use that to create the hash table with
+                // an initial capacity such that, if it was completely full,
+                // it would still satisfy the max_inmemory condition.  Note
+                // that this isn't a hard limit--the hash table can grow if
+                // needed.
+                if (hash_table == null)
+                {
+					// Check to see how much memory we think the first row
+                    // is going to take, and then use that to set the initial
+                    // capacity of the Hashtable.
+                    double rowUsage = getEstimatedMemUsage(row);
+                    hash_table = new Hashtable((int)(max_inmemory_size / rowUsage));
+                }
 
                 if (needsToClone)
                 {
@@ -387,13 +431,7 @@
         inmemory_rowcnt++;
         if( max_inmemory_rowcnt <= 0)
         {
-            for( int i = 0; i < row.length; i++)
-            {
-                if( row[i] instanceof DataValueDescriptor)
-                    max_inmemory_size -= ((DataValueDescriptor) row[i]).estimateMemoryUsage();
-                max_inmemory_size -= ClassSize.refSize;
-            }
-            max_inmemory_size -= ClassSize.refSize;
+            max_inmemory_size -= getEstimatedMemUsage(row);
             if( firstDuplicate)
                 max_inmemory_size -= vectorSize;
         }
@@ -464,6 +502,29 @@
         diskHashtable.put( key, row);
         return true;
     } // end of spillToDisk
+
+    /**
+     * Take a row and return an estimate as to how much memory that
+     * row will consume.
+     * 
+     * @param row The row for which we want to know the memory usage.
+     * @return A guess as to how much memory the current row will
+     *  use.
+     */
+    private long getEstimatedMemUsage(Object [] row)
+    {
+        long rowMem = 0;
+        for( int i = 0; i < row.length; i++)
+        {
+            if (row[i] instanceof DataValueDescriptor)
+                rowMem += ((DataValueDescriptor) row[i]).estimateMemoryUsage();
+            rowMem += ClassSize.refSize;
+        }
+
+        rowMem += ClassSize.refSize;
+        return rowMem;
+    }
+
     /**************************************************************************
      * Public Methods of This class:
      **************************************************************************

Modified: db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/CostEstimateImpl.java
URL: http://svn.apache.org/viewcvs/db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/CostEstimateImpl.java?rev=397675&r1=397674&r2=397675&view=diff
==============================================================================
--- db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/CostEstimateImpl.java (original)
+++ db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/CostEstimateImpl.java Thu Apr 27 16:18:48 2006
@@ -113,7 +113,53 @@
 			}
 		}
 
-		return this.cost - ((CostEstimateImpl) other).cost;
+		/* Note: if both CostEstimates are infinity, an attempt to
+		 * substract them will result in NaN, which tells us nothing
+		 * and thus makes it impossible to do a comparison.  So in
+		 * that case we fallback and check the row counts as a secondary
+		 * point of comparison, and the singleScanRowCounts as a
+		 * third comparison.  If all three values are infinity
+		 * for both CostEstimates then we just consider the two
+		 * costs to equal (equally as bad?) and so return 0.0d (instead
+		 * NaN).  RESOLVE: Ideally the optimizer could be updated
+		 * to give more reasonable estimates than infinity, but
+		 * until that happens we're forced to deal with such
+		 * comparisons.  Note that we're most likely to end up with
+		 * infinite cost estimates in situations where we have deeply
+		 * nested subqueries and/or FROM lists with a large number of
+		 * FromTables (such as 10 or more). The reason is that each
+		 * FromTable's cost estimate is (potentially) multiplied by
+		 * the row counts of all preceding FromTables, so if the
+		 * row counts for the preceding FromTables are large, we
+		 * can eventually end up going beyond Double.MAX_VALUE,
+		 * which then gives us infinity.
+		 */
+
+		// If at least one of costs is _not_ infinity, then just do
+		// a normal compare (the other side is less).
+		if ((this.cost != Double.POSITIVE_INFINITY) ||
+			(other.getEstimatedCost() != Double.POSITIVE_INFINITY))
+		{
+			return this.cost - ((CostEstimateImpl) other).cost;
+		}
+
+		// If both costs are infinity, then compare row counts.
+		if ((this.rowCount != Double.POSITIVE_INFINITY) ||
+			(other.rowCount() != Double.POSITIVE_INFINITY))
+		{
+			return this.rowCount - other.rowCount();
+		}
+
+		// If both row counts are infinity, try singleScan counts.
+		if ((this.singleScanRowCount != Double.POSITIVE_INFINITY) ||
+			(other.singleScanRowCount() != Double.POSITIVE_INFINITY))
+		{
+			return this.singleScanRowCount - other.singleScanRowCount();
+		}
+
+		// If we get here, all three parts of both cost estimates are
+		// Infinity; for lack of better choice, just say they're "equal".
+		return 0.0d;
 	}
 
 	/** @see CostEstimate#add */

Modified: db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/FromBaseTable.java
URL: http://svn.apache.org/viewcvs/db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/FromBaseTable.java?rev=397675&r1=397674&r2=397675&view=diff
==============================================================================
--- db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/FromBaseTable.java (original)
+++ db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/FromBaseTable.java Thu Apr 27 16:18:48 2006
@@ -533,7 +533,10 @@
 							predList,
 							outerCost);
 
-		return costEstimate;
+		// The cost that we found from the above call is now stored in the
+		// cost field of this FBT's current access path.  So that's the
+		// cost we want to return here.
+		return getCurrentAccessPath().getCostEstimate();
 	}
 
 	/** @see Optimizable#getTableDescriptor */

Modified: db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/OptimizerImpl.java
URL: http://svn.apache.org/viewcvs/db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/OptimizerImpl.java?rev=397675&r1=397674&r2=397675&view=diff
==============================================================================
--- db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/OptimizerImpl.java (original)
+++ db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/OptimizerImpl.java Thu Apr 27 16:18:48 2006
@@ -1368,8 +1368,24 @@
 				** NOTE: If the user has specified a join order, it will be the
 				** only join order the optimizer considers, so it is OK to use
 				** costing to decide that it is the "best" join order.
+				**
+				** For very deeply nested queries, it's possible that the optimizer
+				** will return an estimated cost of Double.INFINITY, which is
+				** greater than our uninitialized cost of Double.MAX_VALUE and
+				** thus the "compare" check below will return false.   So we have
+				** to check to see if bestCost is uninitialized and, if so, we
+				** save currentCost regardless of what value it is--because we
+				** haven't found anything better yet.
+				**
+				** That said, it's also possible for bestCost to be infinity
+				** AND for current cost to be infinity, as well.  In that case
+				** we can't really tell much by comparing the two, so for lack
+				** of better alternative we look at the row counts.  See
+				** CostEstimateImpl.compare() for more.
 				*/
-				if ((! foundABestPlan) || currentCost.compare(bestCost) < 0)
+				if ((! foundABestPlan) ||
+					(currentCost.compare(bestCost) < 0) ||
+					bestCost.isUninitialized())
 				{
 					rememberBestCost(currentCost, Optimizer.NORMAL_PLAN);
 
@@ -1414,7 +1430,8 @@
 							trace(CURRENT_PLAN_IS_SA_PLAN, 0, 0, 0.0, null);
 						}
 
-						if (currentSortAvoidanceCost.compare(bestCost) <= 0)
+						if ((currentSortAvoidanceCost.compare(bestCost) <= 0)
+							|| bestCost.isUninitialized())
 						{
 							rememberBestCost(currentSortAvoidanceCost,
 											Optimizer.SORT_AVOIDANCE_PLAN);
@@ -1776,6 +1793,15 @@
 														outerCost,
 														optimizable);
 
+		// Before considering the cost, make sure we set the optimizable's
+		// "current" cost to be the one that we found.  Doing this allows
+		// us to compare "current" with "best" later on to find out if
+		// the "current" plan is also the "best" one this round--if it's
+		// not then we'll have to revert back to whatever the best plan is.
+		// That check is performed in getNextDecoratedPermutation() of
+		// this class.
+		optimizable.getCurrentAccessPath().setCostEstimate(estimatedCost);
+
 		/*
 		** Skip this access path if it takes too much memory.
 		**
@@ -1783,6 +1809,9 @@
 		** a single scan is the total number of rows divided by the number
 		** of outer rows.  The optimizable may over-ride this assumption.
 		*/
+		// RESOLVE: The following call to memoryUsageOK does not behave
+		// correctly if outerCost.rowCount() is POSITIVE_INFINITY; see
+		// DERBY-1259.
 		if( ! optimizable.memoryUsageOK( estimatedCost.rowCount() / outerCost.rowCount(), maxMemoryPerTable))
 		{
 			if (optimizerTrace)
@@ -1797,6 +1826,7 @@
 		CostEstimate bestCostEstimate = ap.getCostEstimate();
 
 		if ((bestCostEstimate == null) ||
+			bestCostEstimate.isUninitialized() ||
 			(estimatedCost.compare(bestCostEstimate) < 0))
 		{
 			ap.setConglomerateDescriptor(cd);
@@ -1844,6 +1874,7 @@
 
 					/* Is this the cheapest sort-avoidance path? */
 					if ((bestCostEstimate == null) ||
+						bestCostEstimate.isUninitialized() ||
 						(estimatedCost.compare(bestCostEstimate) < 0))
 					{
 						ap.setConglomerateDescriptor(cd);
@@ -1912,6 +1943,10 @@
 		** a single scan is the total number of rows divided by the number
 		** of outer rows.  The optimizable may over-ride this assumption.
 		*/
+
+        // RESOLVE: The following call to memoryUsageOK does not behave
+        // correctly if outerCost.rowCount() is POSITIVE_INFINITY; see
+        // DERBY-1259.
         if( ! optimizable.memoryUsageOK( estimatedCost.rowCount() / outerCost.rowCount(),
                                          maxMemoryPerTable))
 		{
@@ -1935,6 +1970,7 @@
 		CostEstimate bestCostEstimate = ap.getCostEstimate();
 
 		if ((bestCostEstimate == null) ||
+			bestCostEstimate.isUninitialized() ||
 			(estimatedCost.compare(bestCostEstimate) <= 0))
 		{
 			ap.setCostEstimate(estimatedCost);
@@ -1969,6 +2005,7 @@
 
 					/* Is this the cheapest sort-avoidance path? */
 					if ((bestCostEstimate == null) ||
+						bestCostEstimate.isUninitialized() ||
 						(estimatedCost.compare(bestCostEstimate) < 0))
 					{
 						ap.setCostEstimate(estimatedCost);

Modified: db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/ProjectRestrictNode.java
URL: http://svn.apache.org/viewcvs/db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/ProjectRestrictNode.java?rev=397675&r1=397674&r2=397675&view=diff
==============================================================================
--- db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/ProjectRestrictNode.java (original)
+++ db/derby/code/trunk/java/engine/org/apache/derby/impl/sql/compile/ProjectRestrictNode.java Thu Apr 27 16:18:48 2006
@@ -324,7 +324,16 @@
 							childCost.rowCount(),
 							childCost.singleScanRowCount());
 
-			optimizer.considerCost(this, restrictionList, getCostEstimate(), outerCost);
+
+			// Note: we don't call "optimizer.considerCost()" here because
+			// a) the child will make that call as part of its own
+			// "optimizeIt()" work above, and b) the child might have
+			// different criteria for "considering" (i.e. rejecting or
+			// accepting) a plan's cost than this ProjectRestrictNode does--
+			// and we don't want to override the child's decision.  So as
+			// with most operations in this class, if the child is an
+			// Optimizable, we just let it do its own work and make its
+			// own decisions.
 		}
 		else if ( ! accessPathModified)
 		{