You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/08/24 21:19:21 UTC
[jira] [Assigned] (SPARK-17099) Incorrect result when HAVING clause
is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-17099:
------------------------------------
Assignee: (was: Apache Spark)
> Incorrect result when HAVING clause is added to group by query
> --------------------------------------------------------------
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
> Issue Type: Bug
> Affects Versions: 2.1.0
> Reporter: Josh Rosen
> Priority: Blocker
>
> Random query generation uncovered the following query which returns incorrect results when run on Spark SQL. This wasn't the original query uncovered by the generator, since I performed a bit of minimization to try to make it more understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
> Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
> (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
> ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
> ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
> COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However, if I omit the {{HAVING}} clause I see that the group's rows are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--------------------------------------+---------------------------------------+--+
> | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) |
> +--------------------------------------+---------------------------------------+--+
> | -507 | -1014 |
> | 940 | 1880 |
> | -769 | -1538 |
> | -367 | -734 |
> | -800 | -1600 |
> +--------------------------------------+---------------------------------------+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm opening this bug to get help in triaging further.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org