You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Jack Klebanoff <kl...@sbcglobal.net> on 2005/03/02 01:13:03 UTC

Re: [PATCH] BackingStoreHashtable

Mike Matrigali wrote:

> Any idea if many more queries will now spill to disk?  I am worried that
> many more will, and now queries will run slower.  Seems like there 
> should be a better connection between the optimizer estimate to use
> in memory hash and what actually is done in memory.  As I understand it
> now the Optimizer costs hash only as in memory, if it's estimates say
> it won't fit into memory it chooses a different plan.  I assume the 
> other plan is more efficient than a hash plan which may have to go to 
> disk for every probe (even if pages are cached going to latched pages
> on every probe is a lot more expensive than in memory hash).
>
> My guess is no one did the work to set max_inmemory_rowcnt, because the
> backing store stuff wasn't implemented yet.
>
> I have not read the code yet.  Is the 1% of all memory, or 1% of free 
> memory?

It is 1% of Runtime.getRuntime().totalMemory().

Changing the costing is difficult. The FromBaseTable.estimateCost method 
uses the JoinStrategy|.|multiplyBaseCostByOuterRows method to compute 
the cost of the join as either the cost of constructing the inner table 
or the number of rows in the outer table times the cost of constructing 
the inner table. HashJoinStrategy|.|multiplyBaseCostByOuterRows returns 
false, so the cost of a hash join is estimated as the cost of 
constructing the hash table, and independent of the number of rows in 
the outer table. This is a reasonable approximation if the hash table is 
in memory, the outer table is not enormous, and the hash function does a 
good job.

If some of the hash table is on disk then neither join cost estimate 
method is very good. Then cost of a hash join is approximately 
hashtableBuildCost + outerTableRowCount*hashProbeCost, where 
hashProbeCost is the expected cost of one probe into the hash table. 
This will be a function of the fraction of inner table rows on disk, and 
the number of rows on disk.

The JoinStrategy interface has a method, estimateCost, that looks like 
it could be made to do the trick, but for some reason 
FromBaseTable.estimateCost does not use it.

The estimate of the inner table row count is used to estimate the size 
of the hash table. If it is large we do not use a hash join at all. If 
we try to include the disk access cost in our hash join cost estimate we 
have to use the inner table row count estimate to estimate the fraction 
of rows on disk. But if the inner table passes the maximum size test 
this fraction is zero if the row count estimate is correct, or very 
small if we assume some statistical variation. It doesn't make sense to 
include the disk hash table probe cost in the join cost estimate unless 
we also ditch the maximum size restriction on hash joins. This might be 
a good idea, but I would like to fix BackingStoreHashtable so that 
neither hash joins, nor DISTINCT, nor scroll insensitive cursors blow up 
before tackling join costing.

I think that the right thing to do is for HashJoinStrategy to pass the 
correct maximum in-memory row count to BackingStoreHashtable so that the 
two are following the same playbook with respect to spilling. I will 
work on this.

Jack

Re: [PATCH] BackingStoreHashtable

Posted by Jack Klebanoff <kl...@sbcglobal.net>.
Mike Matrigali wrote:

> Thanks for the reply, I'll work with you to get this committed.  I will
> wait on the change you are working on. I think that is the best short
> term solution, as you point out there is more work later on to improve
> the work you have done. I would appreciate it if at least one other 
> person with experience on the language side take a look at this also.
>
> It has been awhile since I looked at jvm memory stuff, but it use to be
> a problem that totalMemory() would return the memory that the jvm
> currently has, not the amount of memory that it is allowed to have.  So
> if you called it after just starting it might return a very small 
> number, say 1 meg, even if the jvm was started and told to grow to a max
> of 100 meg.  Worse was that the behavior was not consistent across 
> JVM/OS combinations.
>
> This memory issue is a real problem as there are a number of things
> that derby could do faster if it knew it could do the whole thing in
> memory, but once you run out of memory it is hard to recover without
> failing the current operation (and quite possibly other derby threads 
> and in a server environment other non derby threads).
>
> At one point sun was proposing some jvm interfaces so one could tell if
> you were getting "close" to running out of memory - so that applications
> could take action before errors happened.  If such a thing existed then
> something like BackingStoreHashTable could grow in memory more 
> aggressively and then if it noticed the impending problem it could spill
> everything to disk, and free up it's current usage.
>
I have modified my patch so that the optimizer and BackingStoreHashtable 
use the same decision about when a hash table will spill to disk. The 
optimizer calls the JoinStrategy.maxCapacity method to find the maximum 
number of rows that the JoinStrategy can handle in a given number of 
bytes. It rejects the strategy if the estimated row count is larger. 
(Currently the optimizer limits each join to 1M of memory). The 
HashJoinStrategy.maxCapacity method divides the maximum byte count by 
the sum of the size of one row plus the size of a Hashtable entry. The 
NestedLoopJoinStrategy.maxCapacity method always returns 
Interer.MAX_VALUE. The HashJoinStrategy.getScanArgs method passes the 
maximum capacity to the ResultSetFactory.|getHashScanResultSet method, 
so that the actual BackingStoreHashtable will spill to disk when the 
optimizer thought that it would. This means that hash joins will not 
spill to disk unless the inner table has more rows than the optimizer 
estimated.

I also changed the DiskHashtable implementation to pass its 
keepAfterCommit parameter on to the 
TransactionController.openConglomerate method. Previously DiskHashtable 
only used keepAfterCommit to construct the temporaryFlag argument of 
||TransactionController.createConglomerate and always passed "false" as 
the hold argument of ||TransactionController.openConglomerate.

Since I made changes to the optimizer and hash join code generator I 
hope that a Derby language expert can review at least that part of my 
updated patch.

||I have not changed the way that BackingStoreHashtable decides when to 
spill when its max_inmemory_rowcnt parameter is negative. (Only hash 
joins pass a non-negative ||max_inmemory_rowcnt||). As Mike pointed out, 
spilling when the in memory hash table grows larger than 1% of 
Runtime.totalMemory() is not completely satisfactory. The JVM may be 
able to get more memory and totalMemory() is likely to be small soon 
after the JVM starts up. However, I do not know of anything that is 
better. If totalMemory() grows subsequent ||BackingStoreHashtables will 
be able to use more memory. Since ||BackingStoreHashtables are 
temporary, this does not seem so bad to me.|
|
Regards

Jack Klebanoff
|||||

Re: [PATCH] BackingStoreHashtable

Posted by Mike Matrigali <mi...@sbcglobal.net>.
Thanks for the reply, I'll work with you to get this committed.  I will
wait on the change you are working on. I think that is the best short
term solution, as you point out there is more work later on to improve
the work you have done. I would appreciate it if at least one other 
person with experience on the language side take a look at this also.

It has been awhile since I looked at jvm memory stuff, but it use to be
a problem that totalMemory() would return the memory that the jvm
currently has, not the amount of memory that it is allowed to have.  So
if you called it after just starting it might return a very small 
number, say 1 meg, even if the jvm was started and told to grow to a max
of 100 meg.  Worse was that the behavior was not consistent across 
JVM/OS combinations.

This memory issue is a real problem as there are a number of things
that derby could do faster if it knew it could do the whole thing in
memory, but once you run out of memory it is hard to recover without
failing the current operation (and quite possibly other derby threads 
and in a server environment other non derby threads).

At one point sun was proposing some jvm interfaces so one could tell if
you were getting "close" to running out of memory - so that applications
could take action before errors happened.  If such a thing existed then
something like BackingStoreHashTable could grow in memory more 
aggressively and then if it noticed the impending problem it could spill
everything to disk, and free up it's current usage.



Jack Klebanoff wrote:
> Mike Matrigali wrote:
> 
>> Any idea if many more queries will now spill to disk?  I am worried that
>> many more will, and now queries will run slower.  Seems like there 
>> should be a better connection between the optimizer estimate to use
>> in memory hash and what actually is done in memory.  As I understand it
>> now the Optimizer costs hash only as in memory, if it's estimates say
>> it won't fit into memory it chooses a different plan.  I assume the 
>> other plan is more efficient than a hash plan which may have to go to 
>> disk for every probe (even if pages are cached going to latched pages
>> on every probe is a lot more expensive than in memory hash).
>>
>> My guess is no one did the work to set max_inmemory_rowcnt, because the
>> backing store stuff wasn't implemented yet.
>>
>> I have not read the code yet.  Is the 1% of all memory, or 1% of free 
>> memory?
> 
> 
> It is 1% of Runtime.getRuntime().totalMemory().
> 
> Changing the costing is difficult. The FromBaseTable.estimateCost method 
> uses the JoinStrategy|.|multiplyBaseCostByOuterRows method to compute 
> the cost of the join as either the cost of constructing the inner table 
> or the number of rows in the outer table times the cost of constructing 
> the inner table. HashJoinStrategy|.|multiplyBaseCostByOuterRows returns 
> false, so the cost of a hash join is estimated as the cost of 
> constructing the hash table, and independent of the number of rows in 
> the outer table. This is a reasonable approximation if the hash table is 
> in memory, the outer table is not enormous, and the hash function does a 
> good job.
> 
> If some of the hash table is on disk then neither join cost estimate 
> method is very good. Then cost of a hash join is approximately 
> hashtableBuildCost + outerTableRowCount*hashProbeCost, where 
> hashProbeCost is the expected cost of one probe into the hash table. 
> This will be a function of the fraction of inner table rows on disk, and 
> the number of rows on disk.
> 
> The JoinStrategy interface has a method, estimateCost, that looks like 
> it could be made to do the trick, but for some reason 
> FromBaseTable.estimateCost does not use it.
> 
> The estimate of the inner table row count is used to estimate the size 
> of the hash table. If it is large we do not use a hash join at all. If 
> we try to include the disk access cost in our hash join cost estimate we 
> have to use the inner table row count estimate to estimate the fraction 
> of rows on disk. But if the inner table passes the maximum size test 
> this fraction is zero if the row count estimate is correct, or very 
> small if we assume some statistical variation. It doesn't make sense to 
> include the disk hash table probe cost in the join cost estimate unless 
> we also ditch the maximum size restriction on hash joins. This might be 
> a good idea, but I would like to fix BackingStoreHashtable so that 
> neither hash joins, nor DISTINCT, nor scroll insensitive cursors blow up 
> before tackling join costing.
> 
> I think that the right thing to do is for HashJoinStrategy to pass the 
> correct maximum in-memory row count to BackingStoreHashtable so that the 
> two are following the same playbook with respect to spilling. I will 
> work on this.
> 
> Jack
> 
>