You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Boaz Ben-Zvi <bb...@mapr.com> on 2017/01/14 03:46:40 UTC

Drill: Memory Spilling for the Hash Aggregate Operator

 Hi Drill developers,


     Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.


     Please send me any comments or questions,


        -- Boaz

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Boaz Ben-Zvi <bb...@mapr.com>.

If the number of (sub ?) partitions equals the (fixed!) available memory size divided by the block size, then spilling would have to occur one block at a time, whenever any block becomes full.
This has a drawback for spilling into a hard drive - each block would likely be placed on disk away from the prior written block (for that partition), hence when the whole partition is read back into memory, the read will incur lots of disk SEEKs !!
Spilling multiple blocks at a time significantly mitigates this problem. (Or future SSD storage).

If all the sub-partitions are written into the same file, then how can a single (or few) sub-partition be read without going through the whole file ?  Do the stats hold the offsets of each block per partition, and then the reader “hops” into those offsets ?
This assumes that the size of the memory block matches the disk’s block (e.g., if the latter is much bigger, then the disk “over reads” and does not get high utilization).
Another partial solution can be to initially mix writing sub-partitions into a single file, and once some size limit was reached for one of the sub-partitions, then continue spilling this particular sub-partition into a separate file.    

About "some partition blocks never completed and written to disk”: If those belong to partitions that were spilled, then likely they would have to be written to disk once all the original input was read.
That’s because once reading back a partition from disk, all of the available memory is likely needed; having blocks of the following partitions just “hanging” there is a waste of memory.
Maybe with stats, we could pick the “smaller” partitions first, which would “clean up” some of those partially filled blocks, thus freeing memory for the bigger ones. 
This seems like a big code overhead. In the first Drill implementation we plan to just spill all the partial blocks (that belonging to spilled partitions) once all the input was read.

Another problem I just realized (with the Drill architecture in general) is that the execution is only driven by the next() calls. That makes it harder to benefit from a fast disk scan of many blocks.
The current plan for Hash Aggregate spill is to read a spilled partition the same way “incoming” is read - read one batch, do the aggregations, then the next batch, etc.
Any improvement would add complexities (e.g. memory management); so maybe would be done in some later time. 

  — Boaz 

> On Jan 17, 2017, at 10:20 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> I agree that an aggregate tends to compress data more than a join. Joins do compress data somewhat — when they have a filtering effect — so for both hash-aggregation and hash-join the size estimate is just an upper bound.
> 
> I also agree that the hash aggregate will fit data into memory in either the first or second phase. But the “much larger than memory” case is still very important. Think of what you would do to make an almost-unique list of customer ids into a fully unique list.
> 
> HHJ does not “mix up” sub-partitions, not in a bad way, anyway. By design HHJ uses as many output partitions as possible (available memory divided by block size, because each partition needs one block of buffer space to prepare the data to be written), so it would not be practical to have one output file per sub-partition. There are far more sub-partitions than could be efficiently written to disk, so HHJ merely collects stats for them. All of the data in a partition will be read back at once, so the “mixing up” is not harmful.
> 
> HHJ’s I/O is extremely efficient; I don’t believe that it ever does random access or reads a byte more than once. And due to its “hybrid” nature, some partition blocks are never completed and written to disk.
> 
> Anyway. I’ve said my piece. Histograms of sub-partition stats are not an essential part of this algorithm but I wanted to make sure that you were aware of them, because they are as elegant as bloom filters and b-trees.
> 
> Julian
> 
> 
> 
> 
>> On Jan 16, 2017, at 6:55 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>> 
>> The design in the document has an undocumented assumption — allow the memory available for an operator to fluctuate during the operator’s lifetime.
>> For example, the Hash Agregate operator (HAG) may hit the memory limit at 500MB, than later hit it again at 600MB, etc, and when reading a spilled partition back from disk the limit may be 400MB, etc.
>> (A more gentle scheme may allow memory to increase, never go down; e.g., other operators may "give back” some of their allocations during execution; 
>> e.g. HHJ after finishing the build phase — can yield the extra memory to another operator). 
>> This is a more sophisticated memory management design than what we have now — a simple fixed pre-allocation. 
>> 
>> A second point is Hash Aggregation — which is a little different from HHJ. For example, spilling a partition X (in multiple spill iterations) ends up with 500MB on disk, but only (a fixed) 400MB memory is available.
>> Does this mean that partition X would not fit into the memory ?  No, as X may contain many duplicate groups (groups were recreated after every spill), so as we read and aggregate X, its size may “shrink” enough for the 400MB.
>> 
>> Indeed the design was not trying to avoid “re-writing to disk”; however the assumption was that this would be a rare situation, and may only apply to a part of the data.
>> If the initial number of partitions chosen N is large enough, then spilling would happen only once per (some of the) data.
>> 
>> For HAG, the partitions should all be of similar sizes, so if N was chosen too small, they all would need a “secondary spilling” which would include re-writing all the data (that’s the analogous to “phase 3” ….)
>> HHJ is a little different (due to possible key duplicate rows) — some partitions may be bigger than others. So maybe only few HHJ partitions would be written twice.
>> 
>> The downside of the “sub-partition” model described in Goetz’ paper is that indeed a second write of all the data is saved, but all the data (mixed together) needs to be read (up to) n times (plus some CPU overhead to filter out the needed parts).
>> Read may be costly, due to seeks (e.g., when a partition was spilled in many iterations, and ended scattered across the disk.) 
>> For the (future) HHJ design — we could handle a “too big” partition using a “hash loop”; that is, read only a part of the inner partition, then scan the whole outer matching partition, then read the next inner part, and scan the whole outer partition again.
>> This is indeed costly, but not much different from the “sub-partitioning” scheme — we just read part of the inner based on size (instead of several parts, like 0.0, 0.1, and 0.2), and then the whole outer without filtering - the hash table probing would do that.
>> 
>>    Thanks for the suggestions and the link; I’ll go over Goetz’ paper again and look for more ideas.
>> 
>>         — Boaz
>> 
>> 
>>> On Jan 16, 2017, at 4:09 PM, Julian Hyde <jh...@apache.org> wrote:
>>> 
>>> Does the data need to be written into a disk-friendly format when a partition is selected to be written to disk? If you are careful in your choice of format then it doesn’t need to be re-written. And in fact you can start with the assumption that everything is going to disk.
>>> 
>>> One of the most beautiful innovations of the HHJ paper [1] is the histogram-driven partitioning. Basically, during phase 1 you apply the phase 2 hash function to assign rows to “sub-partitions”. Partition 0 would contain sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1, …, 1.n. The rows are all mixed together in each partition, but you know how many rows (and bytes) are in each sub-partition. If partition 0 (or any partition) ends up larger than memory then you are going to need a phase 3. But you can enter phase 2 armed with some very useful knowledge. You know the sizes of the sub-partitions and you can choose a hash function in phase 2 such that many of the partitions end up *just* smaller than memory.
>>> 
>>> The big problem with external sort and hash algorithms is the huge performance hit when you require an extra phase. If you need 2 phases, HHJ can convert that to 1.5 phases (by pulling smaller partitions back into memory) and by optimizing the assignment of rows to partitions it can turn a 3.1 phase query into a 2.9 phase query - a big win.
>>> 
>>> Julian
>>> 
>>> [1] https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>
>>> 
>>> 
>>> 
>>>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>>>> 
>>>> Sorry for no attachment (Apache mail rules) -- Here is a link to the document:
>>>> 
>>>> 
>>>> DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
>>>> 
>>>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>>>> 
>>>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>>>> drive.google.com
>>>> 
>>>> 
>>>> 
>>>> -- Boaz
>>>> 
>>>> ________________________________
>>>> From: Julian Hyde <jh...@apache.org>
>>>> Sent: Friday, January 13, 2017 11:00 PM
>>>> To: dev@drill.apache.org
>>>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
>>>> 
>>>> The attachment didn't come through. I'm hoping that you settled on a "hybrid" hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.
>>>> 
>>>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>>>>> 
>>>>> Hi Drill developers,
>>>>> 
>>>>>  Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>>>>> 
>>>>>  Please send me any comments or questions,
>>>>> 
>>>>>     -- Boaz
>>>> 
>>> 
>> 
>

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Julian Hyde <jh...@apache.org>.

I agree that an aggregate tends to compress data more than a join. Joins do compress data somewhat — when they have a filtering effect — so for both hash-aggregation and hash-join the size estimate is just an upper bound.

I also agree that the hash aggregate will fit data into memory in either the first or second phase. But the “much larger than memory” case is still very important. Think of what you would do to make an almost-unique list of customer ids into a fully unique list.

HHJ does not “mix up” sub-partitions, not in a bad way, anyway. By design HHJ uses as many output partitions as possible (available memory divided by block size, because each partition needs one block of buffer space to prepare the data to be written), so it would not be practical to have one output file per sub-partition. There are far more sub-partitions than could be efficiently written to disk, so HHJ merely collects stats for them. All of the data in a partition will be read back at once, so the “mixing up” is not harmful.

HHJ’s I/O is extremely efficient; I don’t believe that it ever does random access or reads a byte more than once. And due to its “hybrid” nature, some partition blocks are never completed and written to disk.

Anyway. I’ve said my piece. Histograms of sub-partition stats are not an essential part of this algorithm but I wanted to make sure that you were aware of them, because they are as elegant as bloom filters and b-trees.

Julian




> On Jan 16, 2017, at 6:55 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
> 
> The design in the document has an undocumented assumption — allow the memory available for an operator to fluctuate during the operator’s lifetime.
> For example, the Hash Agregate operator (HAG) may hit the memory limit at 500MB, than later hit it again at 600MB, etc, and when reading a spilled partition back from disk the limit may be 400MB, etc.
> (A more gentle scheme may allow memory to increase, never go down; e.g., other operators may "give back” some of their allocations during execution; 
> e.g. HHJ after finishing the build phase — can yield the extra memory to another operator). 
> This is a more sophisticated memory management design than what we have now — a simple fixed pre-allocation. 
> 
> A second point is Hash Aggregation — which is a little different from HHJ. For example, spilling a partition X (in multiple spill iterations) ends up with 500MB on disk, but only (a fixed) 400MB memory is available.
> Does this mean that partition X would not fit into the memory ?  No, as X may contain many duplicate groups (groups were recreated after every spill), so as we read and aggregate X, its size may “shrink” enough for the 400MB.
> 
> Indeed the design was not trying to avoid “re-writing to disk”; however the assumption was that this would be a rare situation, and may only apply to a part of the data.
> If the initial number of partitions chosen N is large enough, then spilling would happen only once per (some of the) data.
> 
> For HAG, the partitions should all be of similar sizes, so if N was chosen too small, they all would need a “secondary spilling” which would include re-writing all the data (that’s the analogous to “phase 3” ….)
> HHJ is a little different (due to possible key duplicate rows) — some partitions may be bigger than others. So maybe only few HHJ partitions would be written twice.
> 
> The downside of the “sub-partition” model described in Goetz’ paper is that indeed a second write of all the data is saved, but all the data (mixed together) needs to be read (up to) n times (plus some CPU overhead to filter out the needed parts).
> Read may be costly, due to seeks (e.g., when a partition was spilled in many iterations, and ended scattered across the disk.) 
> For the (future) HHJ design — we could handle a “too big” partition using a “hash loop”; that is, read only a part of the inner partition, then scan the whole outer matching partition, then read the next inner part, and scan the whole outer partition again.
> This is indeed costly, but not much different from the “sub-partitioning” scheme — we just read part of the inner based on size (instead of several parts, like 0.0, 0.1, and 0.2), and then the whole outer without filtering - the hash table probing would do that.
> 
>     Thanks for the suggestions and the link; I’ll go over Goetz’ paper again and look for more ideas.
> 
>          — Boaz
> 
> 
>> On Jan 16, 2017, at 4:09 PM, Julian Hyde <jh...@apache.org> wrote:
>> 
>> Does the data need to be written into a disk-friendly format when a partition is selected to be written to disk? If you are careful in your choice of format then it doesn’t need to be re-written. And in fact you can start with the assumption that everything is going to disk.
>> 
>> One of the most beautiful innovations of the HHJ paper [1] is the histogram-driven partitioning. Basically, during phase 1 you apply the phase 2 hash function to assign rows to “sub-partitions”. Partition 0 would contain sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1, …, 1.n. The rows are all mixed together in each partition, but you know how many rows (and bytes) are in each sub-partition. If partition 0 (or any partition) ends up larger than memory then you are going to need a phase 3. But you can enter phase 2 armed with some very useful knowledge. You know the sizes of the sub-partitions and you can choose a hash function in phase 2 such that many of the partitions end up *just* smaller than memory.
>> 
>> The big problem with external sort and hash algorithms is the huge performance hit when you require an extra phase. If you need 2 phases, HHJ can convert that to 1.5 phases (by pulling smaller partitions back into memory) and by optimizing the assignment of rows to partitions it can turn a 3.1 phase query into a 2.9 phase query - a big win.
>> 
>> Julian
>> 
>> [1] https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>
>> 
>> 
>> 
>>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>>> 
>>> Sorry for no attachment (Apache mail rules) -- Here is a link to the document:
>>> 
>>> 
>>> DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
>>> 
>>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>>> 
>>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>>> drive.google.com
>>> 
>>> 
>>> 
>>>  -- Boaz
>>> 
>>> ________________________________
>>> From: Julian Hyde <jh...@apache.org>
>>> Sent: Friday, January 13, 2017 11:00 PM
>>> To: dev@drill.apache.org
>>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
>>> 
>>> The attachment didn't come through. I'm hoping that you settled on a "hybrid" hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.
>>> 
>>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>>>> 
>>>> Hi Drill developers,
>>>> 
>>>>   Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>>>> 
>>>>   Please send me any comments or questions,
>>>> 
>>>>      -- Boaz
>>> 
>> 
>

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Boaz Ben-Zvi <bb...@mapr.com>.

The design in the document has an undocumented assumption — allow the memory available for an operator to fluctuate during the operator’s lifetime.
For example, the Hash Agregate operator (HAG) may hit the memory limit at 500MB, than later hit it again at 600MB, etc, and when reading a spilled partition back from disk the limit may be 400MB, etc.
(A more gentle scheme may allow memory to increase, never go down; e.g., other operators may "give back” some of their allocations during execution; 
e.g. HHJ after finishing the build phase — can yield the extra memory to another operator). 
This is a more sophisticated memory management design than what we have now — a simple fixed pre-allocation. 

A second point is Hash Aggregation — which is a little different from HHJ. For example, spilling a partition X (in multiple spill iterations) ends up with 500MB on disk, but only (a fixed) 400MB memory is available.
Does this mean that partition X would not fit into the memory ?  No, as X may contain many duplicate groups (groups were recreated after every spill), so as we read and aggregate X, its size may “shrink” enough for the 400MB.

Indeed the design was not trying to avoid “re-writing to disk”; however the assumption was that this would be a rare situation, and may only apply to a part of the data.
If the initial number of partitions chosen N is large enough, then spilling would happen only once per (some of the) data.

For HAG, the partitions should all be of similar sizes, so if N was chosen too small, they all would need a “secondary spilling” which would include re-writing all the data (that’s the analogous to “phase 3” ….)
HHJ is a little different (due to possible key duplicate rows) — some partitions may be bigger than others. So maybe only few HHJ partitions would be written twice.

The downside of the “sub-partition” model described in Goetz’ paper is that indeed a second write of all the data is saved, but all the data (mixed together) needs to be read (up to) n times (plus some CPU overhead to filter out the needed parts).
Read may be costly, due to seeks (e.g., when a partition was spilled in many iterations, and ended scattered across the disk.) 
For the (future) HHJ design — we could handle a “too big” partition using a “hash loop”; that is, read only a part of the inner partition, then scan the whole outer matching partition, then read the next inner part, and scan the whole outer partition again.
This is indeed costly, but not much different from the “sub-partitioning” scheme — we just read part of the inner based on size (instead of several parts, like 0.0, 0.1, and 0.2), and then the whole outer without filtering - the hash table probing would do that.

     Thanks for the suggestions and the link; I’ll go over Goetz’ paper again and look for more ideas.

          — Boaz

> On Jan 16, 2017, at 4:09 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> Does the data need to be written into a disk-friendly format when a partition is selected to be written to disk? If you are careful in your choice of format then it doesn’t need to be re-written. And in fact you can start with the assumption that everything is going to disk.
> 
> One of the most beautiful innovations of the HHJ paper [1] is the histogram-driven partitioning. Basically, during phase 1 you apply the phase 2 hash function to assign rows to “sub-partitions”. Partition 0 would contain sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1, …, 1.n. The rows are all mixed together in each partition, but you know how many rows (and bytes) are in each sub-partition. If partition 0 (or any partition) ends up larger than memory then you are going to need a phase 3. But you can enter phase 2 armed with some very useful knowledge. You know the sizes of the sub-partitions and you can choose a hash function in phase 2 such that many of the partitions end up *just* smaller than memory.
> 
> The big problem with external sort and hash algorithms is the huge performance hit when you require an extra phase. If you need 2 phases, HHJ can convert that to 1.5 phases (by pulling smaller partitions back into memory) and by optimizing the assignment of rows to partitions it can turn a 3.1 phase query into a 2.9 phase query - a big win.
> 
> Julian
> 
> [1] https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>
> 
> 
> 
>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>> 
>> Sorry for no attachment (Apache mail rules) -- Here is a link to the document:
>> 
>> 
>> DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
>> 
>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>> 
>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>> drive.google.com
>> 
>> 
>> 
>>   -- Boaz
>> 
>> ________________________________
>> From: Julian Hyde <jh...@apache.org>
>> Sent: Friday, January 13, 2017 11:00 PM
>> To: dev@drill.apache.org
>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
>> 
>> The attachment didn't come through. I'm hoping that you settled on a "hybrid" hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.
>> 
>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>>> 
>>> Hi Drill developers,
>>> 
>>>    Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>>> 
>>>    Please send me any comments or questions,
>>> 
>>>       -- Boaz
>> 
>

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Julian Hyde <jh...@apache.org>.

Does the data need to be written into a disk-friendly format when a partition is selected to be written to disk? If you are careful in your choice of format then it doesn’t need to be re-written. And in fact you can start with the assumption that everything is going to disk.

One of the most beautiful innovations of the HHJ paper [1] is the histogram-driven partitioning. Basically, during phase 1 you apply the phase 2 hash function to assign rows to “sub-partitions”. Partition 0 would contain sub-partitions 0.0, 0.1, … 0.n; partition 1 would contain sub-partitions 1.1, …, 1.n. The rows are all mixed together in each partition, but you know how many rows (and bytes) are in each sub-partition. If partition 0 (or any partition) ends up larger than memory then you are going to need a phase 3. But you can enter phase 2 armed with some very useful knowledge. You know the sizes of the sub-partitions and you can choose a hash function in phase 2 such that many of the partitions end up *just* smaller than memory.

The big problem with external sort and hash algorithms is the huge performance hit when you require an extra phase. If you need 2 phases, HHJ can convert that to 1.5 phases (by pulling smaller partitions back into memory) and by optimizing the assignment of rows to partitions it can turn a 3.1 phase query into a 2.9 phase query - a big win.

Julian

[1] https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf <https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf>

> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
> 
>  Sorry for no attachment (Apache mail rules) -- Here is a link to the document:
> 
> 
> DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
> 
> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
> 
> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
> drive.google.com
> 
> 
> 
>    -- Boaz
> 
> ________________________________
> From: Julian Hyde <jh...@apache.org>
> Sent: Friday, January 13, 2017 11:00 PM
> To: dev@drill.apache.org
> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
> 
> The attachment didn't come through. I'm hoping that you settled on a "hybrid" hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.
> 
>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>> 
>> Hi Drill developers,
>> 
>>     Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>> 
>>     Please send me any comments or questions,
>> 
>>        -- Boaz
>

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Boaz Ben-Zvi <bb...@mapr.com>.

  Sorry for no attachment (Apache mail rules) -- Here is a link to the document:


DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing

[https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>

DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
drive.google.com



    -- Boaz

________________________________
From: Julian Hyde <jh...@apache.org>
Sent: Friday, January 13, 2017 11:00 PM
To: dev@drill.apache.org
Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator

The attachment didn't come through. I'm hoping that you settled on a "hybrid" hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.

> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>
>  Hi Drill developers,
>
>      Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>
>      Please send me any comments or questions,
>
>         -- Boaz

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Boaz Ben-Zvi <bb...@mapr.com>.

  Hi Julian,

      Yes, the model used for the memory spilling is the “hybrid” one — the memory space is pre-divided into “partitions” (based on the hash values of the key columns), and then as needed only some of the partitions would end up spilled to secondary storage (e.g. HDD), hence after all the input is read, some of the partitions would still be fully in memory (the term chosen for those was “pristine”); hence for the data in those partitions there would be no extra overhead. The remaining partitions would incur the needed IO overhead.
   In the worst case all the partitions would end up spilled; in a typical case only some of them; whenever the memory limit is reached — the decision on which partition to spill next is a delicate one — choosing a “pristine” partition would increase the total overhead, vs. choosing a previously spilled partition which yields little memory (and increases disk seeks later).
   All this is explained in the document; hope it is accessible now.

       And the same model should be used for the Hash Join later on.

               — Boaz

> On Jan 13, 2017, at 11:00 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> The attachment didn’t come through. I’m hoping that you settled on a “hybrid” hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe’s hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.
> 
>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>> 
>> Hi Drill developers,
>> 
>>     Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>> 
>>     Please send me any comments or questions,
>> 
>>        -- Boaz
>

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Boaz Ben-Zvi <bb...@mapr.com>.

  Sorry for no attachment (Apache mail rules) -- Here is a link to the document:


DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing

[https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>

DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
drive.google.com



    -- Boaz

________________________________
From: Julian Hyde <jh...@apache.org>
Sent: Friday, January 13, 2017 11:00 PM
To: dev@drill.apache.org
Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator

The attachment didn't come through. I'm hoping that you settled on a "hybrid" hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.

> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>
>  Hi Drill developers,
>
>      Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>
>      Please send me any comments or questions,
>
>         -- Boaz

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Boaz Ben-Zvi <bb...@mapr.com>.

  Hi Julian,

      Yes, the model used for the memory spilling is the “hybrid” one — the memory space is pre-divided into “partitions” (based on the hash values of the key columns), and then as needed only some of the partitions would end up spilled to secondary storage (e.g. HDD), hence after all the input is read, some of the partitions would still be fully in memory (the term chosen for those was “pristine”); hence for the data in those partitions there would be no extra overhead. The remaining partitions would incur the needed IO overhead.
   In the worst case all the partitions would end up spilled; in a typical case only some of them; whenever the memory limit is reached — the decision on which partition to spill next is a delicate one — choosing a “pristine” partition would increase the total overhead, vs. choosing a previously spilled partition which yields little memory (and increases disk seeks later).
   All this is explained in the document; hope it is accessible now.

       And the same model should be used for the Hash Join later on.

               — Boaz

> On Jan 13, 2017, at 11:00 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> The attachment didn’t come through. I’m hoping that you settled on a “hybrid” hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe’s hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.
> 
>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>> 
>> Hi Drill developers,
>> 
>>     Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
>> 
>>     Please send me any comments or questions,
>> 
>>        -- Boaz
>

Re: Drill: Memory Spilling for the Hash Aggregate Operator

Posted by Julian Hyde <jh...@apache.org>.

The attachment didn’t come through. I’m hoping that you settled on a “hybrid” hash algorithm that can write to disk, or write to memory, and the cost of discovering that is wrong is not too great. With Goetz Graefe’s hybrid hash join (which can be easily adapted to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it in memory, then revisit the stuff you spilled to disk.

> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
> 
>  Hi Drill developers,
> 
>      Attached is a document describing the design for memory spilling implementation for the Hash Aggregate operator.
> 
>      Please send me any comments or questions,
> 
>         -- Boaz