You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Gopal Vijayaraghavan (Jira)" <ji...@apache.org> on 2019/11/08 06:53:00 UTC
[jira] [Comment Edited] (HIVE-22448) CBO: Expand the multiple count distinct with a group-by key

    [ https://issues.apache.org/jira/browse/HIVE-22448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969880#comment-16969880 ] 

Gopal Vijayaraghavan edited comment on HIVE-22448 at 11/8/19 6:52 AM:
----------------------------------------------------------------------

bq.  I am trying to understand why this improvement is always a good idea

Unless we change other parts of the shuffle implementation, this is a good idea right now (specifically, implement a rack level combiner & we can do better than a grouping set + hash aggregate).

bq. along with grouping set (which increases the data size)

We're moving around where the data-size increase is actually happening.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L374

bq. adding an extra reducer

That is true, but the distribution function is different between these two end points which is why we end up getting performance improvements due to distribution function changes with this implementation when the group by key has a low nDV (like a date) and the count distinct key has a high nDV (like a user_id).

That is a distributed SQL engine specific improvement where shuffling on a high nDV column always uses more cpu cores better than a shuffle on a low nDV column (in the query example, we send the z & x keys across all reducers in the first pass which prevents a skew forming there - assume if z=[1] and x=[1,2], then only 2 reducers do any work out of the 1009 estimated).

The no-key group-by is the extreme case of that, which we fixed before, which had only 1 reducer effectively receiving every single key in the count distinct.

That was originally fixed up in the physical optimizer, but I prefer the Calcite approach as it is a better place for this.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L64

bq. If so, isn't it better to make this optimization statistics based?

Everything is better that way - the statistics aren't absolute and are estimate driven, so the fact that this optimization right now gives us an improvement on the per-row cost which is not accounted for in CBO right now (as in the constant multiplier against the row doesn't change).

As I mentioned before, this change indirectly corrects for a potential skew in the group-by key which can only be estimated if we have histograms.


was (Author: gopalv):
bq.  I am trying to understand why this improvement is always a good idea

Unless we change other parts of the shuffle implementation, this is a good idea right now (specifically, implement a rack level combiner & we can do better than a grouping set + hash aggregate).

bq. along with grouping set (which increases the data size)

We're moving around where the data-size increase is actually happening.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L374

bq. adding an extra reducer

That is true, but the distribution function is different between these two end points which is why we end up getting performance improvements due to distribution function changes with this implementation when the group by key has a low nDV (like a date) and the count distinct key has a high nDV (like a user_id).

That is a distributed SQL engine specific improvement where shuffling on a high nDV column always uses more cpu cores better than a shuffle on a low nDV column (in the query example, we send the z & x keys across all reducers in the first pass which prevents a skew forming there).

The no-key group-by is the extreme case of that, which we fixed before, which had only 1 reducer effectively receiving every single key in the count distinct.

That was originally fixed up in the physical optimizer, but I prefer the Calcite approach as it is a better place for this.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L64

bq. If so, isn't it better to make this optimization statistics based?

Everything is better that way - the statistics aren't absolute and are estimate driven, so the fact that this optimization right now gives us an improvement on the per-row cost which is not accounted for in CBO right now (as in the constant multiplier against the row doesn't change).

As I mentioned before, this change indirectly corrects for a potential skew in the group-by key which can only be estimated if we have histograms.

> CBO: Expand the multiple count distinct with a group-by key
> -----------------------------------------------------------
>
>                 Key: HIVE-22448
>                 URL: https://issues.apache.org/jira/browse/HIVE-22448
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-22448.01.patch, HIVE-22448.02.patch, HIVE-22448.02.patch, HIVE-22448.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> create temporary table mytable1 (x integer, y integer, z integer, a integer);
> explain cbo
> select z, x, count(distinct y), count(distinct a)
> from mytable1
> group by z, x;
> explain cbo
> select count(distinct y), count(distinct a)
> from mytable1
> {code}
> The first is not vectorized, the second one is because of the grouping-set based rewrite for count distinct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)