You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2016/08/17 02:18:20 UTC
[jira] [Created] (SPARK-17099) Incorrect result when complex HAVING clause is added to query

Josh Rosen created SPARK-17099:
----------------------------------

             Summary: Incorrect result when complex HAVING clause is added to query
                 Key: SPARK-17099
                 URL: https://issues.apache.org/jira/browse/SPARK-17099
             Project: Spark
          Issue Type: Bug
    Affects Versions: 2.1.0
            Reporter: Josh Rosen
            Priority: Critical
             Fix For: 2.1.0


Random query generation uncovered the following query which returns incorrect results when run on Spark SQL. This wasn't the original query uncovered by the generator, since I performed a bit of minimization to try to make it more understandable.

With the following tables:

{code}
val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
val t2 = sc.parallelize(
  Seq(
    (-769, -244),
    (-800, -409),
    (940, 86),
    (-507, 304),
    (-367, 158))
).toDF("int_col_2", "int_col_5")

t1.registerTempTable("t1")
t2.registerTempTable("t2")
{code}

Run

{code}
SELECT
  (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
     ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
FROM t1
RIGHT JOIN t2
  ON (t2.int_col_2) = (t1.int_col_5)
GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
         COALESCE(t1.int_col_5, t2.int_col_2)
HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
{code}

In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However, if I omit the {{HAVING}} clause I see that the group's rows are being incorrectly filtered by it:

{code}
+--------------------------------------+---------------------------------------+--+
| sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2)  |
+--------------------------------------+---------------------------------------+--+
| -507                                 | -1014                                 |
| 940                                  | 1880                                  |
| -769                                 | -1538                                 |
| -367                                 | -734                                  |
| -800                                 | -1600                                 |
+--------------------------------------+---------------------------------------+--+
{code}

Based on this, the output after adding the {{HAVING}} should contain four rows, not zero.

I'm not sure how to further shrink this in a straightforward way, so I'm opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org