You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:44:22 UTC
[jira] [Resolved] (SPARK-24440) When use constant as column we may
get wrong answer versus impala
[ https://issues.apache.org/jira/browse/SPARK-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24440.
----------------------------------
Resolution: Incomplete
> When use constant as column we may get wrong answer versus impala
> -----------------------------------------------------------------
>
> Key: SPARK-24440
> URL: https://issues.apache.org/jira/browse/SPARK-24440
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0, 2.3.0
> Reporter: zhoukang
> Priority: Major
> Labels: bulk-closed
>
> For query below:
> {code:java}
> select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform
> {code}
> We intended to group by 100 and get distinct deviceid number.
> By spark sql,we get:
> {code}
> +-----------+-----------+-----------+--+
> | date | platform | new_user |
> +-----------+-----------+-----------+--+
> | 20180528 | 100 | 521 |
> | 20180528 | 100 | 82 |
> | 20180528 | 100 | 3 |
> | 20180528 | 100 | 2 |
> | 20180528 | 100 | 7 |
> | 20180528 | 100 | 870 |
> | 20180528 | 100 | 3 |
> | 20180528 | 100 | 8 |
> | 20180528 | 100 | 3 |
> | 20180528 | 100 | 2204 |
> | 20180528 | 100 | 1123 |
> | 20180528 | 100 | 1 |
> | 20180528 | 100 | 54 |
> | 20180528 | 100 | 440 |
> | 20180528 | 100 | 4 |
> | 20180528 | 100 | 478 |
> | 20180528 | 100 | 34 |
> | 20180528 | 100 | 195 |
> | 20180528 | 100 | 17 |
> | 20180528 | 100 | 18 |
> | 20180528 | 100 | 2 |
> | 20180528 | 100 | 2 |
> | 20180528 | 100 | 84 |
> | 20180528 | 100 | 1616 |
> | 20180528 | 100 | 15 |
> | 20180528 | 100 | 7 |
> | 20180528 | 100 | 479 |
> | 20180528 | 100 | 50 |
> | 20180528 | 100 | 376 |
> | 20180528 | 100 | 21 |
> | 20180528 | 100 | 842 |
> | 20180528 | 100 | 444 |
> | 20180528 | 100 | 538 |
> | 20180528 | 100 | 1 |
> | 20180528 | 100 | 2 |
> | 20180528 | 100 | 7 |
> | 20180528 | 100 | 17 |
> | 20180528 | 100 | 133 |
> | 20180528 | 100 | 7 |
> | 20180528 | 100 | 415 |
> | 20180528 | 100 | 2 |
> | 20180528 | 100 | 318 |
> | 20180528 | 100 | 5 |
> | 20180528 | 100 | 1 |
> | 20180528 | 100 | 2060 |
> | 20180528 | 100 | 1217 |
> | 20180528 | 100 | 2 |
> | 20180528 | 100 | 60 |
> | 20180528 | 100 | 22 |
> | 20180528 | 100 | 4 |
> +-----------+-----------+-----------+--+
> {code}
> Actually sum of the deviceid is below:
> {code}
> 0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform)t1;
> +----------------+--+
> | sum(new_user) |
> +----------------+--+
> | 14816 |
> +----------------+--+
> 1 row selected (4.934 seconds)
> {code}
> And the real distinct deviceid value is below:
> {code}
> 0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528;
> +-----------+-----------+--+
> | platform | new_user |
> +-----------+-----------+--+
> | 100 | 14773 |
> +-----------+-----------+--+
> 1 row selected (2.846 seconds)
> {code}
> In impala,with the first query we can get result below:
> {code}
> [xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform
> +----------+----------+----------+
> | date | platform | new_user |
> +----------+----------+----------+
> | 20180528 | 100 | 14773 |
> +----------+----------+----------+
> Fetched 1 row(s) in 1.00s
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org