You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Army <qo...@sbcglobal.net> on 2006/03/28 00:00:52 UTC

Multi-column index usage with Hash joins?

In a thread on derby-user a long while back there was a brief discussion of 
multi-column indexes and when Derby can use them.  One quote that came from that 
thread was the following:

http://article.gmane.org/gmane.comp.apache.db.derby.user/3175

<begin quote>

Derby can only use the index if all the leading columns in a multi-column index 
are part of the where clause.  Note that it can use it if only 2 leading of 3 
columns are available.  Often such an index is created so that it is a 
"covering" index so that the results of a select can be returned just from the 
index without going to the base table.

<end quote>

So given that Derby can only use an index if all the _leading_ columns in the 
index are part of the where clause, I'm wondering if that same restriction 
should be true when it comes to finding hash key columns for a hash join?

More specifically, there is code in Derby to look at a list of predicates to 
determine which columns of an index are referenced by those predicates, and 
every such column is then treated as a "hash key column".  The presence of at 
least one hash key column tells Derby that it is possible to do a hash join 
using the index--i.e. that the hash join is "feasible"--and thus the optimizer 
will figure out the cost of the hash join and factor that into its choice of 
"best access path".

The code to find the key columns is in HashJoinStrategy.findHashKeyColumns():

...

// Build a Vector of all the hash key columns
int colCtr;
Vector hashKeyVector = new Vector();
for (colCtr = 0; colCtr < columns.length; colCtr++)
{
	// Is there an equijoin condition on this column?
	if (predList.hasOptimizableEquijoin(innerTable, columns[colCtr]))
	{
		hashKeyVector.addElement(new Integer(colCtr));
	}
}

...

Just before this block of code the "columns" array is initialized according to 
the conglomerate descriptor for innerTable--so if the descriptor is an index, 
"columns" will hold all the columns that make up the index.

Notice that this code does _not_ take into consideration the notion of "leading" 
columns for an index.  If, for example, the index columns are [2, 1, 3] and the 
predicate list has a single predicate that references column 3, the "if" 
statement above will return false for "2" and "1", and then it will return true 
for "3".  Then, since we found at least one hash key column, Derby determines 
that a hash join is feasible and so will go on to cost the hash join using the 
index.

However, since column "3" is _not_ a leading column for the index, this doesn't 
seem correct to me--or at the very least, it seems sub-optimal.  Since we 
haven't matched the leading column of the index, is it really useful to use the 
index?  When costing the hash join the optimizer will take into account whether 
or not the index is a covering index, but it does not, so far as I can see, 
account for the fact that the hash key columns might *not* be leading columns. 
On the contrary, it looks like the costing code assumes that if the index is 
going to be used, we have matched a leading set of columns.  (see 
FromBaseTable.estimateCost()).

The reason I bring this up is because I have a rather large query that has 
several levels of nested subqueries and a whole slew of predicates on the tables 
and subqueries in question.  When run against Derby as it is, the query takes 
hours (yes, HOURS) to execute--I killed it after 2 hours and it had only 
returned a fraction of the total rows in the result set.  But if I add logic to 
the "findHashkeyColumns" method above so that it only counts _leading_ columns 
as "hash key columns", the same query executes in under 3 minutes.  That's still 
a tad long, but it's far better than hours...

The change I made was to add the following "else" block to the code snippet above:

	// Is there an equijoin condition on this column?
	if (predList.hasOptimizableEquijoin(innerTable, columns[colCtr]))
	{
		hashKeyVector.addElement(new Integer(colCtr));
	}
+	else if ((cd != null) && cd.isIndex() &&
+		!innerTable.isCoveringIndex(cd)) {
+	// if we're trying to do a hash join with an index conglomerate,
+	// then we either need to match a _leading_ set of columns from
+	// the conglomerate, or else the index has to be covering; otherwise
+	// we can't use the index to evaluate all of the predicates,
+	// which means we end up going back to the table, which is
+	// not optimal.  So here, if the index is not covering, we
+	// only consider the leading columns as hash keys.
+		break;
+	}

All of that said, can anyone out there provided feedback/comments on the 
following two questions:

1) Does it really make sense to use an index for a hash join when the hash key 
columns are _not_ leading columns for the index?

2) Does the change included above make sense?  I've run derbyall with this 
change applied and didn't see any unexpected failures, and I've convinced myself 
that this is the correct thing to do.  But of course, convincing myself is the 
easy part--if anyone else out there can offer input one way or the other, I'd 
appreciate it...

In the event that I hear no protests, I'll create a Jira entry and post this 
small change as an "enhancement" for 10.2...

Thanks much,
Army

Re: Multi-column index usage with Hash joins?

Posted by A B <qo...@gmail.com>.

Mike Matrigali wrote:
> This is a high level answer, with no specific knowledge about what the
> optimizer does actually.

Thanks, as always, for taking the time to reply, Mike.  I'll have to read over 
your comments and ponder this some more before moving forward...

Thanks again,
Army

Re: Multi-column index usage with Hash joins?

Posted by Mike Matrigali <mi...@sbcglobal.net>.

This is a high level answer, with no specific knowledge about what the
optimizer does actually.

I would say that it makes sense to cost a hash join with any set of
columns in an index as long at the cost appropriately accounts for how
many rows in the index it will have to look at to get those rows.  So
I would guess that the cost for using non leading columns would mostly
be the cost of reading ALL rows in the index.  This may be an 
appropriate plan just as it may be appropriate to scan an entire index
if all you are looking for is where 3rd key column = ? -- as the index
may be way smaller than the base table.

other comments below.

Army wrote:
> In a thread on derby-user a long while back there was a brief discussion 
> of multi-column indexes and when Derby can use them.  One quote that 
> came from that thread was the following:
> 
> http://article.gmane.org/gmane.comp.apache.db.derby.user/3175
> 
> <begin quote>
> 
> Derby can only use the index if all the leading columns in a 
> multi-column index are part of the where clause.  Note that it can use 
> it if only 2 leading of 3 columns are available.  Often such an index is 
> created so that it is a "covering" index so that the results of a select 
> can be returned just from the index without going to the base table.
> 
> <end quote>
> 
> So given that Derby can only use an index if all the _leading_ columns 
> in the index are part of the where clause, I'm wondering if that same 
> restriction should be true when it comes to finding hash key columns for 
> a hash join?
> 
> More specifically, there is code in Derby to look at a list of 
> predicates to determine which columns of an index are referenced by 
> those predicates, and every such column is then treated as a "hash key 
> column".  The presence of at least one hash key column tells Derby that 
> it is possible to do a hash join using the index--i.e. that the hash 
> join is "feasible"--and thus the optimizer will figure out the cost of 
> the hash join and factor that into its choice of "best access path".
> 
> The code to find the key columns is in 
> HashJoinStrategy.findHashKeyColumns():
> 
> ...
> 
> // Build a Vector of all the hash key columns
> int colCtr;
> Vector hashKeyVector = new Vector();
> for (colCtr = 0; colCtr < columns.length; colCtr++)
> {
>     // Is there an equijoin condition on this column?
>     if (predList.hasOptimizableEquijoin(innerTable, columns[colCtr]))
>     {
>         hashKeyVector.addElement(new Integer(colCtr));
>     }
> }
> 
> ...
> 
> Just before this block of code the "columns" array is initialized 
> according to the conglomerate descriptor for innerTable--so if the 
> descriptor is an index, "columns" will hold all the columns that make up 
> the index.
> 
> Notice that this code does _not_ take into consideration the notion of 
> "leading" columns for an index.  If, for example, the index columns are 
> [2, 1, 3] and the predicate list has a single predicate that references 
> column 3, the "if" statement above will return false for "2" and "1", 
> and then it will return true for "3".  Then, since we found at least one 
> hash key column, Derby determines that a hash join is feasible and so 
> will go on to cost the hash join using the index.
> 
> However, since column "3" is _not_ a leading column for the index, this 
> doesn't seem correct to me--or at the very least, it seems sub-optimal.  
> Since we haven't matched the leading column of the index, is it really 
> useful to use the index?  When costing the hash join the optimizer will 
> take into account whether or not the index is a covering index, but it 
> does not, so far as I can see, account for the fact that the hash key 
> columns might *not* be leading columns. On the contrary, it looks like 
> the costing code assumes that if the index is going to be used, we have 
> matched a leading set of columns.  (see FromBaseTable.estimateCost()).
> 
> The reason I bring this up is because I have a rather large query that 
> has several levels of nested subqueries and a whole slew of predicates 
> on the tables and subqueries in question.  When run against Derby as it 
> is, the query takes hours (yes, HOURS) to execute--I killed it after 2 
> hours and it had only returned a fraction of the total rows in the 
> result set.  But if I add logic to the "findHashkeyColumns" method above 
> so that it only counts _leading_ columns as "hash key columns", the same 
> query executes in under 3 minutes.  That's still a tad long, but it's 
> far better than hours...
> 
> The change I made was to add the following "else" block to the code 
> snippet above:
> 
>     // Is there an equijoin condition on this column?
>     if (predList.hasOptimizableEquijoin(innerTable, columns[colCtr]))
>     {
>         hashKeyVector.addElement(new Integer(colCtr));
>     }
> +    else if ((cd != null) && cd.isIndex() &&
> +        !innerTable.isCoveringIndex(cd)) {
> +    // if we're trying to do a hash join with an index conglomerate,
> +    // then we either need to match a _leading_ set of columns from
> +    // the conglomerate, or else the index has to be covering; otherwise
> +    // we can't use the index to evaluate all of the predicates,
> +    // which means we end up going back to the table, which is
> +    // not optimal.  So here, if the index is not covering, we
> +    // only consider the leading columns as hash keys.
> +        break;
> +    }
> 
> All of that said, can anyone out there provided feedback/comments on the 
> following two questions:
> 
> 1) Does it really make sense to use an index for a hash join when the 
> hash key columns are _not_ leading columns for the index?

I think yes, just not as useful.  If you have leading columns then
one can set specific start and stop ranges for a scan on the btree. 
Without a leading column you have to scan all rows and apply qualifiers,
which still may be better than scanning all rows in base table.  Depends
on number of rows you expect to qualify and relatively how many trips
to base table per scanned row you will have to take.

> 
> 2) Does the change included above make sense?  I've run derbyall with 
> this change applied and didn't see any unexpected failures, and I've 
> convinced myself that this is the correct thing to do.  But of course, 
> convincing myself is the easy part--if anyone else out there can offer 
> input one way or the other, I'd appreciate it...

again i am not sure about the whole process, but it does not seem like
you need to a covering index as one may be able to eliminate multiple
index rows without going to the base table so it may be useful to scan
an index, even if it is not covering.  Also I am not sure what
isCoveringIndex() does - the comments you included above was using
the definition of a covering index as one which has all the columns
in the select list, where for choosing the usefulness of an index it may
only need the columns in the where clause.

> 
> In the event that I hear no protests, I'll create a Jira entry and post 
> this small change as an "enhancement" for 10.2...
> 
> Thanks much,
> Army
> 
> 
>