You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2018/07/21 01:13:08 UTC
[Impala-ASF-CR] PREVIEW: IMPALA-110: Support for multiple DISTINCT

Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10771 )

Change subject: PREVIEW: IMPALA-110: Support for multiple DISTINCT
......................................................................


Patch Set 1:

(7 comments)

Did a quick initial pass over the backend part of this.

http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG@60
PS1, Line 60: - Ran hdfs/core tests
Can we add some spilling tests too?


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc
File be/src/exec/aggregation-node.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc@118
PS1, Line 118:     // Separate input batch into mini batches destined for the different aggs.
I feel like there's a lot of unnecessary overhead in the pre-partitioning into mini-batches (the approach in general and maybe some of the specifics of how this code uses the RowBatch APIs).

I'd be interested to see how much time is spent here for a high-NDV aggregation to see if it's worth optimising.

We could avoid the construction of the intermediate batches if we pushed down logic into AddBatch so that it consumed rows from 'batch' up until it saw one that didn't match its index.


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc
File be/src/exec/grouping-aggregator-ir.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@171
PS1, Line 171:       ++streaming_idx_;
Let's hoist the store to 'streaming_idx_' out of the loop.


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@190
PS1, Line 190:         out_batch->ClearRow(row);
This seems expensive to include in the inner loop. Can we avoid doing this entirely in the non-multiagg case? And do it efficiently with a memset() before calling this function in the multiagg case.


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@191
PS1, Line 191: agg_idx_
Let's hoist this load out of the loop too. Can we codegen this out for the non-multi-agg case?


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h
File be/src/exec/grouping-aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h@296
PS1, Line 296:   int32_t streaming_idx_;
Comments?


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc
File be/src/exec/grouping-aggregator.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc@285
PS1, Line 285:     row_batch->ClearRow(row);
This seems like unnecessary overhead.



-- 
To view, visit http://gerrit.cloudera.org:8080/10771
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4
Gerrit-Change-Number: 10771
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall <th...@cmu.edu>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Sat, 21 Jul 2018 01:13:08 +0000
Gerrit-HasComments: Yes