You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Zoltan Haindrich (JIRA)" <ji...@apache.org> on 2017/02/25 14:16:44 UTC
[jira] [Updated] (HIVE-15848) count or sum distinct incorrect when
hive.optimize.reducededuplication set to true
[ https://issues.apache.org/jira/browse/HIVE-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoltan Haindrich updated HIVE-15848:
------------------------------------
Attachment: HIVE-15848.1.patch
All my efforts to leave the optimization enabled in this case have failed - however I still feel that it might be saved somehow.
I've encountered a very intresting thing while working on this: for the second case the optimization didn't kick in because the key order is permutated wrt to the other RS - I'm not sure what's causing this - but it may prevent this optimisation from happening in other cases as well. {{ReduceSinkDeDuplication#sameKeys}}
I propose to disable reduce deduplication in cases like this:
patch#1: when the about to be removed reduce sink is also doing distinct related work(DistinctColumnIndices) - the optimization is disabled.
> count or sum distinct incorrect when hive.optimize.reducededuplication set to true
> ----------------------------------------------------------------------------------
>
> Key: HIVE-15848
> URL: https://issues.apache.org/jira/browse/HIVE-15848
> Project: Hive
> Issue Type: Bug
> Affects Versions: 0.13.0
> Reporter: Biao Wu
> Assignee: Zoltan Haindrich
> Priority: Critical
> Attachments: HIVE-15848.1.patch
>
>
> Test Table:
> {code:sql}
> create table test(id int,key int,name int);
> {code}
> Data:
> ||id||key||name||
> |1 |1 |2
> |1 |2 |3
> |1 |3 |2
> |1 |4 |2
> |1 |5 |3
> Test SQL1:
> {code:sql}
> select id,count(Distinct key),count(Distinct name)
> from (select id,key,name from count_distinct_test group by id,key,name)m
> group by id;
> {code}
> result:
> |1|5|4
> expect:
> |1|5|2
> Test SQL2:
> {code:sql}
> select id,count(Distinct name),count(Distinct key)
> from (select id,key,name from count_distinct_test group by id,name,key)m
> group by id;
> {code}
> result:
> |1|2|5
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)