You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by ilooner <gi...@git.apache.org> on 2018/02/03 01:29:01 UTC

[GitHub] drill issue #1101: DRILL-6032: Made the batch sizing for HashAgg more accura...

Github user ilooner commented on the issue:

https://github.com/apache/drill/pull/1101

@Ben-Zvi I have responded to comments and implemented requested changes. Please see latest commits for changes. I have made some additional changes after noticing that some of the batch sizing calculations were incorrect, I have also made various documentation and naming changes to improve the readability of the code and to document what needs to be improved in the future.:

- The size of varchar columns were not properly accounted for in the outgoing batch in the case where varchars are aggregated. I have added logic to the updateEst... method to account for this.
- I have added docs to the updateEst method explaining what the various batch sizing estimates do.
- I have changed the variable names for the batch size estimates to more accurately reflect what they do.
- Previously if a batch size estimate was smaller than the actual amount of memory allocated, the estimate was updated to be the larger size. I think this was actually the wrong behavior since it causes the HashAgg operator's memory estimates to be extremely aggressive and in the worst case can cause the operator to run out of memory prematurely. Ideally a good estimate will over estimate 50% of the time and under estimate 50% of the time.
- I have changed the HashAgg defaults. Although tests passed with the previous defaults, I felt they were too aggressive. I have changed the default number of partitions to 16 and the minimum number of batches to 1.

Let me know if you have anymore comments. Thanks.

---