You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Igor (Jira)" <ji...@apache.org> on 2019/09/02 09:10:00 UTC
[jira] [Commented] (HIVE-21930) WINDOW COUNT DISTINCT return wrong
value with PARTITION BY
[ https://issues.apache.org/jira/browse/HIVE-21930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920717#comment-16920717 ]
Igor commented on HIVE-21930:
-----------------------------
Any updates on this?
> WINDOW COUNT DISTINCT return wrong value with PARTITION BY
> ----------------------------------------------------------
>
> Key: HIVE-21930
> URL: https://issues.apache.org/jira/browse/HIVE-21930
> Project: Hive
> Issue Type: Bug
> Components: PTF-Windowing
> Affects Versions: 3.1.0
> Environment: Beeline version 3.1.0.3.0.1.0-187 by Apache Hive
> Reporter: Igor
> Priority: Major
> Labels: distinct, window_funcion
>
> count(distinct a) over (partiton by b) return wring result. For example (T is CTE here):
> {code:java}
> select p, day, ts
> , row_number() OVER (PARTITION BY phone ORDER BY ts ASC) as line_number
> , count(1) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lines
> , count(distinct day) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as days
> FROM T{code}
> WINDOW specification doesn't affect on results: same wrong with and without window.
> count(1) and count(distinct day) return the same result. Count distinct is wrong.
>
> I've add size(collect_set(day) OVER (PARTITION BY phone)) as days2 and count(distinct return correct result.
> Following query return non-empty result:
> {code:java}
> select A.*, B.days, B. from (
> select p, day, ts
> , row_number() OVER (PARTITION BY phone ORDER BY ts ASC) as line_number
> , count(1) OVER (PARTITION BY p ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lines
> , count(distinct day) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as days
> , size(collect_set(day) OVER (PARTITION BY phone)) as days2
> , dense_rank() over (partition by phone order by day) + dense_rank() over (partition by phone order by day desc) - 1 as days3
> FROM T ) as A
> join (
> select p, day, ts
> , row_number() OVER (PARTITION BY phone ORDER BY ts ASC) as line_number
> , count(1) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lines
> , count(distinct day) OVER (PARTITION BY phone ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as days
> FROM T
> ) as B on A.p=B.p and A.line_number=B.line_number
> where A.days!=B.days
> order by A.p, A.line_number
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)