You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Jack Klebanoff <kl...@Mutagen.Net> on 2004/12/10 04:46:02 UTC

BackingStoreHashtable

Derby uses class BackingStoreHashtable to implement a hash table used in 
hash joins. Despite the class's name it does not spill to disk; the hash 
table is always in memory.

The optimizer tries to avoid hash joins when the input is large because 
BackingStoreHashtable would use too much memory. The optimizer just 
works with row count estimates, so there is no guarantee that 
BackingStoreHashtable will not blow up. No customer has complained about 
this yet.

I would like to work on this, changing BackingStoreHashtable to spill to 
disk when the hash table gets large, and changing the optimizer to 
understand that large hash tables are allowed, but more costly.

The advantages of doing this are that
1. hash joins will not blow up when the compiler's row count estimate 
was low,
2. the optmizer can choose hash joins on large inputs when that is the 
least costly join implementation, and
3. we can use BackingStoreHashtable to implement features such as 
INTERSECT and GROUP BY, though I am not proposing to do so now.

I am not proposing to implement a hash table that can be used as an 
alternative to Btrees for a secondary index. BackingStoreHashtable is 
used for transient data structures that are only accessed by a single 
thread. A secondary index implementation must deal with locking and must 
implement hashCode methods that are JVM independent. This is much more 
work and would yield a slower implementation.

I propose that BackingStoreHashtable should start off using an in-memory 
hash table even if the estimated row count is large. That way it will 
use an in-memory hash table when the actual row count is small enough 
for the table to fit in memory. When it finds that spilling to disk is 
necessary BackingStoreHashtable will use the estimated row count to 
determine the initial number of buckets and move the in-memory entries 
to disk. The disk based hash table will use a linear hashing algorithm, 
see "External Memory Algorithms and Data Structures: Dealing withMassive 
Data", Jeffrey Scott Vitter, ACM Computing Surveys, Vol. 33, No. 2, June 
2001, pp. 209–271. It grows the hash table one bucket at a time when the 
average number of entries per bucket grows beyond a threshold. The 
disadvantage of growing by a larger number of buckets is that the last 
expansion may be unnecessarity large, wasting time and space.

I would appreciate it if anyone can point me to a better external hash 
table algorithm.

The disk hash table will use overflow pages because an imperfect hash 
function will cause some buckets to get more than their share of 
entries, and because we may have duplicate keys.

I have not yet looked at how the optimizer handles hash joins, so I do 
not yet have a proposal for how to change that.

Can anyone explain how to generate a cost for the Derby optimizer? How 
do you compute it from an estimated number of disk accesses? Does an 
estimate of 2 disk accesses mean a cost of 2.0? Should the cost estimate 
include CPU time? If so, how do you estimate it relative to disk access 
cost? Putting it another way, what does a cost estimate of 1.0 (or x) mean?

Comments, criticisms, questions?

Jack Klebanoff

Re: BackingStoreHashtable

Posted by Daniel John Debrunner <dj...@debrunners.com>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mike Matrigali wrote:

> and some more history, originally when the overflow portion of
> BackingStoreHashTable was considered more of an error path a suggested
> implementation was to just use the existing derby temporary btree table
> implementation to implement the overflow portion of the class.  The
> key would be the hash key, and the data the rest of row of the result set.
>
> benefits are it would be easy to implement, be a lot less code to
> maintain, and reuse existing technology.  It already has full support
> for bounded access to duplicates.
>
> Access performance would likely be worse than a hash table assuming
> single I/O access to a given key.  Searches would have to traverse the
> in memory cached index nodes.

I think the potential performance of each should be careful considered
before a new on-disk structure is created. It seems like the btree would
work well, but it's not clear to me what Jack is proposing yet.

Jack, what happens when the in-memory hash table overflows to disk?
Is it a complete switch to the disk version, or is the in-memory table
in front of the disk version with a sub-set of the data?

Dan.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFBufuEIv0S4qsbfuQRAlexAJ0Z76uZ7tbf94C7SzTW/teKT/UH3wCfYSk3
16N2j6Q6R6G4IwRY5913ZuE=
=LZ9H
-----END PGP SIGNATURE-----

Re: BackingStoreHashtable

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

and some more history, originally when the overflow portion of
BackingStoreHashTable was considered more of an error path a suggested
implementation was to just use the existing derby temporary btree table
implementation to implement the overflow portion of the class.  The
key would be the hash key, and the data the rest of row of the result set.

benefits are it would be easy to implement, be a lot less code to
maintain, and reuse existing technology.  It already has full support
for bounded access to duplicates.

Access performance would likely be worse than a hash table assuming
single I/O access to a given key.  Searches would have to traverse the
in memory cached index nodes.

Does anyone know if other dbms's provide a hash index for SQL indexes?
I know CA-ingres did when I looked at it many years ago - but it was
static hashing, so table had to modified if too much overflow happened.

Jack Klebanoff wrote:
| Derby uses class BackingStoreHashtable to implement a hash table used in
| hash joins. Despite the class's name it does not spill to disk; the hash
| table is always in memory.
|
| The optimizer tries to avoid hash joins when the input is large because
| BackingStoreHashtable would use too much memory. The optimizer just
| works with row count estimates, so there is no guarantee that
| BackingStoreHashtable will not blow up. No customer has complained about
| this yet.
|
| I would like to work on this, changing BackingStoreHashtable to spill to
| disk when the hash table gets large, and changing the optimizer to
| understand that large hash tables are allowed, but more costly.
|
| The advantages of doing this are that
| 1. hash joins will not blow up when the compiler's row count estimate
| was low,
| 2. the optmizer can choose hash joins on large inputs when that is the
| least costly join implementation, and
| 3. we can use BackingStoreHashtable to implement features such as
| INTERSECT and GROUP BY, though I am not proposing to do so now.
|
| I am not proposing to implement a hash table that can be used as an
| alternative to Btrees for a secondary index. BackingStoreHashtable is
| used for transient data structures that are only accessed by a single
| thread. A secondary index implementation must deal with locking and must
| implement hashCode methods that are JVM independent. This is much more
| work and would yield a slower implementation.
|
| I propose that BackingStoreHashtable should start off using an in-memory
| hash table even if the estimated row count is large. That way it will
| use an in-memory hash table when the actual row count is small enough
| for the table to fit in memory. When it finds that spilling to disk is
| necessary BackingStoreHashtable will use the estimated row count to
| determine the initial number of buckets and move the in-memory entries
| to disk. The disk based hash table will use a linear hashing algorithm,
| see "External Memory Algorithms and Data Structures: Dealing withMassive
| Data", Jeffrey Scott Vitter, ACM Computing Surveys, Vol. 33, No. 2, June
| 2001, pp. 209–271. It grows the hash table one bucket at a time when the
| average number of entries per bucket grows beyond a threshold. The
| disadvantage of growing by a larger number of buckets is that the last
| expansion may be unnecessarity large, wasting time and space.
|
| I would appreciate it if anyone can point me to a better external hash
| table algorithm.
|
| The disk hash table will use overflow pages because an imperfect hash
| function will cause some buckets to get more than their share of
| entries, and because we may have duplicate keys.
|
| I have not yet looked at how the optimizer handles hash joins, so I do
| not yet have a proposal for how to change that.
|
| Can anyone explain how to generate a cost for the Derby optimizer? How
| do you compute it from an estimated number of disk accesses? Does an
| estimate of 2 disk accesses mean a cost of 2.0? Should the cost estimate
| include CPU time? If so, how do you estimate it relative to disk access
| cost? Putting it another way, what does a cost estimate of 1.0 (or x)
mean?
|
| Comments, criticisms, questions?
|
| Jack Klebanoff
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBufSBEpeslyHqPs0RAq8FAJ9IsBYdbcRmDcUQtnV/wJWHiFFQUwCg8p/z
IWx5PC40FBGl3bxU895xO0U=
=RCZK
-----END PGP SIGNATURE-----

Re: BackingStoreHashtable

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Some notes on costing information:

Costing information was developed when most cloudscape databases were
relatively small and query plan options were not very extensive.  The
cost numbers currently are just relative numbers and don't represent
any real thing like elapsed time or cpu time - they are just meant to be
compared against each other.  The
actual numbers are from running a set of internal performance tests on
a single machine a number of years ago, comparing the relative
performance of the operations for which store return cost information
back to it's clients through the SortCostController, StoreCostController,
and StoreCostResult interfaces in the
opensource/java/engine/org/apache/derby/iapi/store/access directory.

The actual tests have not been open sourced yet as they were implemented
as part of the "unit test" structure.  These are the next set of tests
being open sourced and myrna is working as we speak on getting these
through legal so they can be posted.  The tests basically just create a
heap/btree table and measure probes and scans.  They likely underreport
I/O cost
which was not such a problem in the past with small databases, and they
seemed to give correct relative weight such that good plans were picked.

Your proposal would make it important to cost I/O's more realistically,
as the optimizer now would be choosing between plans which might
generate drastically more I/O.  So it would seem it is time to update
the costing information.  This was due anyway as the numbers have not
been run for recent releases and performance dynamics may have changed.

I have worked on systems which instead just assumed an elapsed time for
an I/O and then added that to a estimated cpu cost for the operation.
Another idea that was discussed at clouscape was to somehow come up
with estimates that matched the machine you were running on, such that
the estimates were actually attempting to be real time numbers on
the machine that the server was running on at that time.  Never came up
with best way to do this.  If the units of the costing are changed, we
should come up with some way such that in the future if derby supports
some sort of external data source, that data source can provide the
optimizer
Jack Klebanoff wrote:
| Derby uses class BackingStoreHashtable to implement a hash table used in
| hash joins. Despite the class's name it does not spill to disk; the hash
| table is always in memory.
|
| The optimizer tries to avoid hash joins when the input is large because
| BackingStoreHashtable would use too much memory. The optimizer just
| works with row count estimates, so there is no guarantee that
| BackingStoreHashtable will not blow up. No customer has complained about
| this yet.
|
| I would like to work on this, changing BackingStoreHashtable to spill to
| disk when the hash table gets large, and changing the optimizer to
| understand that large hash tables are allowed, but more costly.
|
| The advantages of doing this are that
| 1. hash joins will not blow up when the compiler's row count estimate
| was low,
| 2. the optmizer can choose hash joins on large inputs when that is the
| least costly join implementation, and
| 3. we can use BackingStoreHashtable to implement features such as
| INTERSECT and GROUP BY, though I am not proposing to do so now.
|
| I am not proposing to implement a hash table that can be used as an
| alternative to Btrees for a secondary index. BackingStoreHashtable is
| used for transient data structures that are only accessed by a single
| thread. A secondary index implementation must deal with locking and must
| implement hashCode methods that are JVM independent. This is much more
| work and would yield a slower implementation.
|
| I propose that BackingStoreHashtable should start off using an in-memory
| hash table even if the estimated row count is large. That way it will
| use an in-memory hash table when the actual row count is small enough
| for the table to fit in memory. When it finds that spilling to disk is
| necessary BackingStoreHashtable will use the estimated row count to
| determine the initial number of buckets and move the in-memory entries
| to disk. The disk based hash table will use a linear hashing algorithm,
| see "External Memory Algorithms and Data Structures: Dealing withMassive
| Data", Jeffrey Scott Vitter, ACM Computing Surveys, Vol. 33, No. 2, June
| 2001, pp. 209–271. It grows the hash table one bucket at a time when the
| average number of entries per bucket grows beyond a threshold. The
| disadvantage of growing by a larger number of buckets is that the last
| expansion may be unnecessarity large, wasting time and space.
|
| I would appreciate it if anyone can point me to a better external hash
| table algorithm.
|
| The disk hash table will use overflow pages because an imperfect hash
| function will cause some buckets to get more than their share of
| entries, and because we may have duplicate keys.
|
| I have not yet looked at how the optimizer handles hash joins, so I do
| not yet have a proposal for how to change that.
|
| Can anyone explain how to generate a cost for the Derby optimizer? How
| do you compute it from an estimated number of disk accesses? Does an
| estimate of 2 disk accesses mean a cost of 2.0? Should the cost estimate
| include CPU time? If so, how do you estimate it relative to disk access
| cost? Putting it another way, what does a cost estimate of 1.0 (or x)
mean?
|
| Comments, criticisms, questions?
|
| Jack Klebanoff
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBueaEEpeslyHqPs0RAo6BAJ0SAGIVMGDxClAHW0hAVIz+OhYgzwCfZ7/P
8pnqmrLCrHDAUJNUUMPrDDI=
=CS/G
-----END PGP SIGNATURE-----

Re: BackingStoreHashtable

Posted by Jack Klebanoff <kl...@sbcglobal.net>.

My mail server dropped RPost's reply so I had to dig it out of the 
archive. I apologize for the delay and improper threading.

RPost wrote:

>Does either 'spill to disk' approach have any other possible future use?
>Perhaps for implementing any other features or functionality on anyone's
>'nice to have' list? Or would either btree or hash implementations be useful
>for only this one purpose?
>
BackingStoreHashtable could be used to implement other operations such 
as INTERSECT, and EXCEPT, and in removing duplicates. I originally 
noticed that BackingStoreHashtable did not spill to disk when 
researching the implementation of INTERSECT.

A dynamic disk hash table could be used for indexes. However 
BackingStoreHashtables are transient, single thread, data structures. A 
hash index must deal with multi-thread issues while 
BackingStoreHashtable would not want to spend time on locking. 
Furthermore BackingStoreHashtables only grow until they are discarded 
entirely. A hash index must deal with shrinking. So, with respect to 
BackingStoreHashtable and hash indexes, I would say close, but no cigar.

Jack

> Jack Klebanoff wrote:
>
>> Derby uses class BackingStoreHashtable to implement a hash table used 
>> in hash joins. Despite the class's name it does not spill to disk; 
>> the hash table is always in memory.
>>
>> The optimizer tries to avoid hash joins when the input is large 
>> because BackingStoreHashtable would use too much memory. The 
>> optimizer just works with row count estimates, so there is no 
>> guarantee that BackingStoreHashtable will not blow up. No customer 
>> has complained about this yet.
>>
>> I would like to work on this, changing BackingStoreHashtable to spill 
>> to disk when the hash table gets large, and changing the optimizer to 
>> understand that large hash tables are allowed, but more costly.
>>
>> The advantages of doing this are that
>> 1. hash joins will not blow up when the compiler's row count estimate 
>> was low,
>> 2. the optmizer can choose hash joins on large inputs when that is 
>> the least costly join implementation, and
>> 3. we can use BackingStoreHashtable to implement features such as 
>> INTERSECT and GROUP BY, though I am not proposing to do so now.
>>
>> I am not proposing to implement a hash table that can be used as an 
>> alternative to Btrees for a secondary index. BackingStoreHashtable is 
>> used for transient data structures that are only accessed by a single 
>> thread. A secondary index implementation must deal with locking and 
>> must implement hashCode methods that are JVM independent. This is 
>> much more work and would yield a slower implementation.
>>
>> I propose that BackingStoreHashtable should start off using an 
>> in-memory hash table even if the estimated row count is large. That 
>> way it will use an in-memory hash table when the actual row count is 
>> small enough for the table to fit in memory. When it finds that 
>> spilling to disk is necessary BackingStoreHashtable will use the 
>> estimated row count to determine the initial number of buckets and 
>> move the in-memory entries to disk. The disk based hash table will 
>> use a linear hashing algorithm, see "External Memory Algorithms and 
>> Data Structures: Dealing withMassive Data", Jeffrey Scott Vitter, ACM 
>> Computing Surveys, Vol. 33, No. 2, June 2001, pp. 209–271. It grows 
>> the hash table one bucket at a time when the average number of 
>> entries per bucket grows beyond a threshold. The disadvantage of 
>> growing by a larger number of buckets is that the last expansion may 
>> be unnecessarity large, wasting time and space.
>>
>> I would appreciate it if anyone can point me to a better external 
>> hash table algorithm.
>>
>> The disk hash table will use overflow pages because an imperfect hash 
>> function will cause some buckets to get more than their share of 
>> entries, and because we may have duplicate keys.
>>
>> I have not yet looked at how the optimizer handles hash joins, so I 
>> do not yet have a proposal for how to change that.
>>
>> Can anyone explain how to generate a cost for the Derby optimizer? 
>> How do you compute it from an estimated number of disk accesses? Does 
>> an estimate of 2 disk accesses mean a cost of 2.0? Should the cost 
>> estimate include CPU time? If so, how do you estimate it relative to 
>> disk access cost? Putting it another way, what does a cost estimate 
>> of 1.0 (or x) mean?
>>
>> Comments, criticisms, questions?
>>
>> Jack Klebanoff
>>

Re: BackingStoreHashtable

Posted by RPost <rp...@pacbell.net>.

Not disagreeing with your conclusion per se but one question.

Does either 'spill to disk' approach have any other possible future use?
Perhaps for implementing any other features or functionality on anyone's
'nice to have' list? Or would either btree or hash implementations be useful
for only this one purpose?


----- Original Message ----- 
From: "Jack Klebanoff" <kl...@Mutagen.Net>
To: "Derby Development" <de...@db.apache.org>
Sent: Tuesday, January 25, 2005 12:15 PM
Subject: Re: BackingStoreHashtable


> The optimizer decides when to implement a join using hash joins. If it
> estimates that one side of the join is small enough to fit in memory it
> may choose a hash join.
>
> The nature of estimates is that sometimes they are wrong. This is true
> for the optimizer's size estimates. Therefore, sometimes the hash table
> created for a hash join does not fit in memory. Currently the result is
> that the JVM terminates with an OutOfMemoryError, or it thrashes.  See
> http://nagoya.apache.org/jira/browse/DERBY-106.
>
> I think that Derby should handle underestimates more gracefully.
>
> When a hash join is executed a BackingStoreHashtable object is
> constructed to implement the join. One of its constructor parameters is
> "max_inmemory_rowcnt", the maximum number of rows to insert into the
> in-memory hash table before overflowing to disk. Despite this
> constructor parameter (and the class name) the current implementation of
> BackingStoreHashtable never spills to disk. Hence Jira 106.
>
> I propose that we change BackingStoreHashtable to spill to disk.
>
> The goals of the change are as follows:
>
> 1. Allow hash joins to work on any input size, up to the amount of
> available disk.
>
> 2. The execution time for hash joins should be reasonable even if the
> size estimates prove to be low.
>
> 3. Have at most a moderate impact on Derby's size.
>
> 4. Use a low amount of development effort.
>
>
> The basic idea is as follows. When the number of rows in a
> BackingStoreHashtable exceeds max_inmemory_rowcnt a disk based
> associative data structure will be created and all new rows will be
> inserted into the disk structure. Existing rows will be left in memory.
> When BackingStoreHashtable is asked to find rows with a given key it
> will first look in the in memory Hashtable, then in the disk structure.
> If duplicate keys are allowed in the BackingStoreHashtable then it must
> look in both places.
>
> (There are several reasonable variations on this scheme dealing with the
> movement of rows between the in-memory hash table and the disk
> structure. They will be discussed later. However, I think that the
> simplest scheme is best).
>
> There are two main alternatives for implementing the disk based
> associative data structure. One is to use Btrees, the other is to
> implement a dynamic hash table.
>
> Btree Implementation
> --------------------
>
> In the Btree implementation we essentially create a temporary table, a
> HeapConglomerate, to hold the overflow rows and create a Btree index on
> the key columns. The only hitch is that our Btree implementation has a
> limit on the total size of a key. Therefore, instead of indexing on the
> key columns we will index on the key hash code.
>
> In order to search for all the rows with a given key we compute the key
> hash code and scan the Btree for the locations of all the rows with that
> hash code. We then fetch the candidate rows from the HeapConglomerate
> and return those with matching keys.
>
> This implementation can use the existing heap and Btree conglomerate
> implementations, so it should not increase the Derby footprint by much,
> nor should it take much development time.
>
> Hash Table Implementation
> -------------------------
>
> In the hash table implementation we write a new conglomerate type that
> implements a dynamic hash table on disk. I would use the dynamic hash
> algorithm of P.-A. Larson, "Dynamic Hash Tables", Communications of the
> ACM, 31(1988). With this algorithm a hash table of N rows is built in
> time O(N*log(N)) and only one disk access is needed to find a row if all
> the rows of its hash bucket fit on one page. Furthermore it grows the
> hash table by one bucket at a time, as needed, so it does not waste very
> much space.
>
> The hash conglomerate would be implemented using two containers: one for
> the first page of each bucket and the second for overflow pages. Two
> containers are necessary so that we can use the hash code as a page
> number to find the first page of a bucket. We cannot put overflow pages
> in the primary container because this mapping would break if we had to
> increase the number of buckets after creating an overflow page.
>
> The overflow container is used when a bucket has too many rows to fit on
> one page and when a row is too large to fit comfortably in a page. A
> long row is represented as a (big) hash code and a pointer to the
> overflow page holding the long row. Access to a long row will always
> take at least two disk accesses, unless one of the pages is in the cache.
>
> Note that there are two related hash functions in use, a big one and a
> small one. The big hash function produces a 32 bit hash code, the one
> used by the standard in memory hash table. The small hash code is the
> low order n bits of the big hash code. The small hash code is used to
> index the disk buckets. Under the Larson algorithm there are between
> 2**(n-1)+1 and 2**n buckets.
>
>
> Comparison
> ----------
>
> We are interested in 4 measures:
> 1. the time required to get all the rows with a given key,
> 2. the time required to build the data structure,
> 3. the disk space used by the data structure, and
> 4. the amount of new code (this impacts both development time and Derby
> footprint).
>
> Access Time
> ----------
>
> In order to find rows using the Btree implementation we first traverse
> the Btree to get a list of all rows with the same big hash code. Then we
> access the heap pages referenced by the Btree and discard the rows whose
> keys do not match. The rows will probably not be clustered in the heap,
> so there is likely to be one disk access per reference. Some number of
> Btree nodes will be accessed, though the top nodes are likely to be in
> the page cache. So the number of disk accesses is likely to be about 1 +
> N(K), where N(K) is the number of rows with the same big hash code value
> as key K.
>
> In order to find rows using the disk hash implementation we just
> traverse the pages in the bucket corresponding to key K. The number of
> page accesses is ceil(n(K)/b), where n(K) is the number of rows with the
> same small hash code as key K and b is the number of rows per page.
>
> So the disk hash implementation will access rows more quickly when
> n(K)/N(K) < b. That is, the disk hash implementation does better when
> the row size is small relative to the page size, or when the small hash
> function doesn't give up too much selectivity over the big hash function.
>
> Build Time
> ----------
>
> Both implementations take O(N*log(N)) time to build in the worst case.
> However there are some differences in the build times.
>
> The Btree implementation will split Btree nodes as the structure is
> built up. However, since the Btree is keyed on an integer hash code this
> is not as expensive as shuffling the full rows. The Btree implementation
> does not move the actual rows as the data structure is populated.
>
> If the hash implementation's initial estimate of the number of buckets
> is good or high then it does not shuffle rows around as the data
> structure is populated and the build time is O(N), which is quite good.
> If the initial bucket count estimate is low then the hash implementation
> shuffles rows around as buckets are split. I expect that most rows are
> be long enough so that this is more expensive than splitting Btree nodes.
>
> So the build time of the hash implementation is better if the initial
> size estimate is good, but it is probably worse than the Btree
> implementation when the size estimate is bad.
>
> Data Structure Size
> -------------------
>
> In the Btree implementation all the "waste" space is in the Btree
> conglomerate. Its heap conglomerate is about as compact as it can be.
> Because the Btree keys on the hash code it is smaller than the heap
> conglomerate, unless the rows are quite small.
>
> The hash implementation wastes space when some of the buckets and/or
> overflow pages are not full. This happens because we initially
> overestimate the number of buckets required, because the small hash
> function distributes the rows poorly, and/or because we choose a load
> factor less than one.
>
> So, sometimes the hash implementation uses less disk space for the data
> structure and sometimes the Btree implementation uses less space. I
> suspect that the Btree implementation is better in most cases.
>
> Code Size
> ---------
>
> The Btree implementation can use the existing Btree and heap
> conglomerate code while the hash implementation must work at a lower
> level to implement a new conglomerate class. So I suspect that the hash
> implementation will require more new code. I don't think that it either
> implementation requires an enormous amount of new code.
>
>
> Moving Rows Between Memory and Disk
> -----------------------------------
>
> BackingStoreHashtable starts off with a purely RAM based structure and,
> under this proposal, spills to disk when it exceeds a size threshold. I
> have proposed that once a row is placed in the RAM portion of the
> structure it never be moved to disk; only new rows be placed on disk.
> This has the virtue of simplicity. However, there are several reasonable
> options that I will discuss for the sake of completeness.
>
> When the disk structure is created we could move some or all of the rows
> to disk. Mike Matrigali suggested moving all the rows to disk to reduce
> Derby's RAM usage, reducing the likelihood of thrashing or an
> OutOfMemoryError particularly in a multi-thread environment. However it
> would also increase the hash table build time and the row fetch time.
> Since Derby's BackingStoreHashtables are transient the data structure
> build time is important. It sometimes be better in terms of memory usage
> to build and use the hash table quickly so that it can be deleted quickly.
>
> Keeping or eliminating duplicates is a BackingStoreHashtable constructor
> option. When there are multiple rows with the same key
> BackingStoreHashtable.get returns a Vector of all the rows. If
> BackingStoreHashtable has started to spill to disk and we insert a row
> with a key that duplicates the key of one or more rows in RAM we could
> move all of the rows to disk. Then the BackingStoreHashtable.get method
> would not have to look on disk when it finds a row in RAM. This would
> improve the average get() time, at the cost of slowing down the
> BackingStoreHashtable build time.
>
>
> Conclusion
> -----------
>
> I recommend that we implement spilling using the Btree implementation
> because it offers a clear advantage over the disk hash implementation in
> footprint and development time while there is no clear advantage for
> either implementation in terms of access time, build time, or data
> structure size. The hash implementation has some advantage in access
> time, build time, and space when the rows have a moderate size and our
> initial size estimate is good. However we only spill to disk when our
> size estimate is bad.
>

Re: BackingStoreHashtable

Posted by Jack Klebanoff <kl...@Mutagen.Net>.

The optimizer decides when to implement a join using hash joins. If it 
estimates that one side of the join is small enough to fit in memory it 
may choose a hash join.

The nature of estimates is that sometimes they are wrong. This is true 
for the optimizer's size estimates. Therefore, sometimes the hash table 
created for a hash join does not fit in memory. Currently the result is 
that the JVM terminates with an OutOfMemoryError, or it thrashes.  See 
http://nagoya.apache.org/jira/browse/DERBY-106.

I think that Derby should handle underestimates more gracefully.

When a hash join is executed a BackingStoreHashtable object is 
constructed to implement the join. One of its constructor parameters is 
"max_inmemory_rowcnt", the maximum number of rows to insert into the 
in-memory hash table before overflowing to disk. Despite this 
constructor parameter (and the class name) the current implementation of 
BackingStoreHashtable never spills to disk. Hence Jira 106.

I propose that we change BackingStoreHashtable to spill to disk.

The goals of the change are as follows:

1. Allow hash joins to work on any input size, up to the amount of 
available disk.

2. The execution time for hash joins should be reasonable even if the 
size estimates prove to be low.

3. Have at most a moderate impact on Derby's size.

4. Use a low amount of development effort.


The basic idea is as follows. When the number of rows in a 
BackingStoreHashtable exceeds max_inmemory_rowcnt a disk based 
associative data structure will be created and all new rows will be 
inserted into the disk structure. Existing rows will be left in memory. 
When BackingStoreHashtable is asked to find rows with a given key it 
will first look in the in memory Hashtable, then in the disk structure. 
If duplicate keys are allowed in the BackingStoreHashtable then it must 
look in both places.

(There are several reasonable variations on this scheme dealing with the 
movement of rows between the in-memory hash table and the disk 
structure. They will be discussed later. However, I think that the 
simplest scheme is best).

There are two main alternatives for implementing the disk based 
associative data structure. One is to use Btrees, the other is to 
implement a dynamic hash table.

Btree Implementation
--------------------

In the Btree implementation we essentially create a temporary table, a 
HeapConglomerate, to hold the overflow rows and create a Btree index on 
the key columns. The only hitch is that our Btree implementation has a 
limit on the total size of a key. Therefore, instead of indexing on the 
key columns we will index on the key hash code.

In order to search for all the rows with a given key we compute the key 
hash code and scan the Btree for the locations of all the rows with that 
hash code. We then fetch the candidate rows from the HeapConglomerate 
and return those with matching keys.

This implementation can use the existing heap and Btree conglomerate 
implementations, so it should not increase the Derby footprint by much, 
nor should it take much development time.

Hash Table Implementation
-------------------------

In the hash table implementation we write a new conglomerate type that 
implements a dynamic hash table on disk. I would use the dynamic hash 
algorithm of P.-A. Larson, "Dynamic Hash Tables", Communications of the 
ACM, 31(1988). With this algorithm a hash table of N rows is built in 
time O(N*log(N)) and only one disk access is needed to find a row if all 
the rows of its hash bucket fit on one page. Furthermore it grows the 
hash table by one bucket at a time, as needed, so it does not waste very 
much space.

The hash conglomerate would be implemented using two containers: one for 
the first page of each bucket and the second for overflow pages. Two 
containers are necessary so that we can use the hash code as a page 
number to find the first page of a bucket. We cannot put overflow pages 
in the primary container because this mapping would break if we had to 
increase the number of buckets after creating an overflow page.

The overflow container is used when a bucket has too many rows to fit on 
one page and when a row is too large to fit comfortably in a page. A 
long row is represented as a (big) hash code and a pointer to the 
overflow page holding the long row. Access to a long row will always 
take at least two disk accesses, unless one of the pages is in the cache.

Note that there are two related hash functions in use, a big one and a 
small one. The big hash function produces a 32 bit hash code, the one 
used by the standard in memory hash table. The small hash code is the 
low order n bits of the big hash code. The small hash code is used to 
index the disk buckets. Under the Larson algorithm there are between 
2**(n-1)+1 and 2**n buckets.


Comparison
----------

We are interested in 4 measures:
1. the time required to get all the rows with a given key,
2. the time required to build the data structure,
3. the disk space used by the data structure, and
4. the amount of new code (this impacts both development time and Derby 
footprint).

Access Time
----------

In order to find rows using the Btree implementation we first traverse 
the Btree to get a list of all rows with the same big hash code. Then we 
access the heap pages referenced by the Btree and discard the rows whose 
keys do not match. The rows will probably not be clustered in the heap, 
so there is likely to be one disk access per reference. Some number of 
Btree nodes will be accessed, though the top nodes are likely to be in 
the page cache. So the number of disk accesses is likely to be about 1 + 
N(K), where N(K) is the number of rows with the same big hash code value 
as key K.

In order to find rows using the disk hash implementation we just 
traverse the pages in the bucket corresponding to key K. The number of 
page accesses is ceil(n(K)/b), where n(K) is the number of rows with the 
same small hash code as key K and b is the number of rows per page.

So the disk hash implementation will access rows more quickly when 
n(K)/N(K) < b. That is, the disk hash implementation does better when 
the row size is small relative to the page size, or when the small hash 
function doesn't give up too much selectivity over the big hash function.

Build Time
----------

Both implementations take O(N*log(N)) time to build in the worst case. 
However there are some differences in the build times.

The Btree implementation will split Btree nodes as the structure is 
built up. However, since the Btree is keyed on an integer hash code this 
is not as expensive as shuffling the full rows. The Btree implementation 
does not move the actual rows as the data structure is populated.

If the hash implementation's initial estimate of the number of buckets 
is good or high then it does not shuffle rows around as the data 
structure is populated and the build time is O(N), which is quite good. 
If the initial bucket count estimate is low then the hash implementation 
shuffles rows around as buckets are split. I expect that most rows are 
be long enough so that this is more expensive than splitting Btree nodes.

So the build time of the hash implementation is better if the initial 
size estimate is good, but it is probably worse than the Btree 
implementation when the size estimate is bad.

Data Structure Size
-------------------

In the Btree implementation all the "waste" space is in the Btree 
conglomerate. Its heap conglomerate is about as compact as it can be. 
Because the Btree keys on the hash code it is smaller than the heap 
conglomerate, unless the rows are quite small.

The hash implementation wastes space when some of the buckets and/or 
overflow pages are not full. This happens because we initially 
overestimate the number of buckets required, because the small hash 
function distributes the rows poorly, and/or because we choose a load 
factor less than one.

So, sometimes the hash implementation uses less disk space for the data 
structure and sometimes the Btree implementation uses less space. I 
suspect that the Btree implementation is better in most cases.

Code Size
---------

The Btree implementation can use the existing Btree and heap 
conglomerate code while the hash implementation must work at a lower 
level to implement a new conglomerate class. So I suspect that the hash 
implementation will require more new code. I don't think that it either 
implementation requires an enormous amount of new code.


Moving Rows Between Memory and Disk
-----------------------------------

BackingStoreHashtable starts off with a purely RAM based structure and, 
under this proposal, spills to disk when it exceeds a size threshold. I 
have proposed that once a row is placed in the RAM portion of the 
structure it never be moved to disk; only new rows be placed on disk. 
This has the virtue of simplicity. However, there are several reasonable 
options that I will discuss for the sake of completeness.

When the disk structure is created we could move some or all of the rows 
to disk. Mike Matrigali suggested moving all the rows to disk to reduce 
Derby's RAM usage, reducing the likelihood of thrashing or an 
OutOfMemoryError particularly in a multi-thread environment. However it 
would also increase the hash table build time and the row fetch time. 
Since Derby's BackingStoreHashtables are transient the data structure 
build time is important. It sometimes be better in terms of memory usage 
to build and use the hash table quickly so that it can be deleted quickly.

Keeping or eliminating duplicates is a BackingStoreHashtable constructor 
option. When there are multiple rows with the same key 
BackingStoreHashtable.get returns a Vector of all the rows. If 
BackingStoreHashtable has started to spill to disk and we insert a row 
with a key that duplicates the key of one or more rows in RAM we could 
move all of the rows to disk. Then the BackingStoreHashtable.get method 
would not have to look on disk when it finds a row in RAM. This would 
improve the average get() time, at the cost of slowing down the 
BackingStoreHashtable build time.


Conclusion
-----------

I recommend that we implement spilling using the Btree implementation 
because it offers a clear advantage over the disk hash implementation in 
footprint and development time while there is no clear advantage for 
either implementation in terms of access time, build time, or data 
structure size. The hash implementation has some advantage in access 
time, build time, and space when the rows have a moderate size and our 
initial size estimate is good. However we only spill to disk when our 
size estimate is bad.

Re: BackingStoreHashtable

Posted by Mike Matrigali <mi...@sbcglobal.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On using large amounts of memory for queries:

I actually believe that the way derby optimizer/language uses hash joins
is optimal until we can get a better handle on memory allocation to
individual dbms clients within the dbms server.  The optimizer trys to
limit the memory allocation for a single query to a "reasonable" amount
and estimates if it can perform a hash join with all the rows in memory.
If it can it chooses hash join and currently always puts all the rows
in memory for the hash join.  If it runs out memory then the query
fails, just like if it runs out of memory while allocating any other
structure for any other query execution.  The system does a similar
thing for sorting, it has an in memory limit and if it estimates it
can't fit in memory it picks an alternative which uses less memory.  In
memory is obviously faster.

The problem with using unbounded memory is that one client in a 100
client system can use up most of the memory and cause the rest of the 99
clients to fail even their memory requirements are small.

For a zero admin database cloudscape choose a conservative approach to
memory usage in most decisions, including hash join and sorting.

I do agree that it would be better if we could recover from out of
memory conditions.  Unfortunately java interfaces are fairly lacking
in this area.  Once you run out memory it is too late to do much as it
it will wreak havoc in many execution paths of a multi-threaded server.

If you are interested in working on completing the implementation of
BackingStoreHashTable, of course it would be better for it to work as
documented - cloudscape just never had the resources to complete it as
it was always too low a priority, given that the optimizer never called
it expecting the overflow to be important.  It was one of those cases
where an interface was provided for a future need, but never implemented.

I do think more discussion is needed before changes are made to increase
the size of hash joins to the point where overflow is necessary.
Currently I see overflow as an edge case where the optimizer was wrong,
it would be better to execute the query slow than to fail.  Again as you
noted I can't remember an application problem being reported because of
the BackingStoreHashTable issue.

Have you considered implementing intersect similar to how hash join is
used currently in derby.  Have the optimizer pick hash join to implement
intersect if number of rows is small and your new sort merge if number
of rows is large?



Jack Klebanoff wrote:
| Derby uses class BackingStoreHashtable to implement a hash table used in
| hash joins. Despite the class's name it does not spill to disk; the hash
| table is always in memory.
|
| The optimizer tries to avoid hash joins when the input is large because
| BackingStoreHashtable would use too much memory. The optimizer just
| works with row count estimates, so there is no guarantee that
| BackingStoreHashtable will not blow up. No customer has complained about
| this yet.
|
| I would like to work on this, changing BackingStoreHashtable to spill to
| disk when the hash table gets large, and changing the optimizer to
| understand that large hash tables are allowed, but more costly.
|
| The advantages of doing this are that
| 1. hash joins will not blow up when the compiler's row count estimate
| was low,
| 2. the optmizer can choose hash joins on large inputs when that is the
| least costly join implementation, and
| 3. we can use BackingStoreHashtable to implement features such as
| INTERSECT and GROUP BY, though I am not proposing to do so now.
|
| I am not proposing to implement a hash table that can be used as an
| alternative to Btrees for a secondary index. BackingStoreHashtable is
| used for transient data structures that are only accessed by a single
| thread. A secondary index implementation must deal with locking and must
| implement hashCode methods that are JVM independent. This is much more
| work and would yield a slower implementation.
|
| I propose that BackingStoreHashtable should start off using an in-memory
| hash table even if the estimated row count is large. That way it will
| use an in-memory hash table when the actual row count is small enough
| for the table to fit in memory. When it finds that spilling to disk is
| necessary BackingStoreHashtable will use the estimated row count to
| determine the initial number of buckets and move the in-memory entries
| to disk. The disk based hash table will use a linear hashing algorithm,
| see "External Memory Algorithms and Data Structures: Dealing withMassive
| Data", Jeffrey Scott Vitter, ACM Computing Surveys, Vol. 33, No. 2, June
| 2001, pp. 209–271. It grows the hash table one bucket at a time when the
| average number of entries per bucket grows beyond a threshold. The
| disadvantage of growing by a larger number of buckets is that the last
| expansion may be unnecessarity large, wasting time and space.
|
| I would appreciate it if anyone can point me to a better external hash
| table algorithm.
|
| The disk hash table will use overflow pages because an imperfect hash
| function will cause some buckets to get more than their share of
| entries, and because we may have duplicate keys.
|
| I have not yet looked at how the optimizer handles hash joins, so I do
| not yet have a proposal for how to change that.
|
| Can anyone explain how to generate a cost for the Derby optimizer? How
| do you compute it from an estimated number of disk accesses? Does an
| estimate of 2 disk accesses mean a cost of 2.0? Should the cost estimate
| include CPU time? If so, how do you estimate it relative to disk access
| cost? Putting it another way, what does a cost estimate of 1.0 (or x)
mean?
|
| Comments, criticisms, questions?
|
| Jack Klebanoff
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBues7EpeslyHqPs0RAiBJAJ4vxQFsUm3iUCqGcK4Lhk4LnRyTygCgmP45
HB/sPHu8wFFsW/GczwiQX3c=
=MPIS
-----END PGP SIGNATURE-----