You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jack Chen (Jira)" <ji...@apache.org> on 2023/04/16 20:30:00 UTC
[jira] [Updated] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

     [ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jack Chen updated SPARK-43156:
------------------------------
    Description: 
Example query:
{code:java}
spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  {code}
In this subquery, count(1) always evaluates to a non-null integer value, so count(1) is null is always false. The correct evaluation of the subquery is always false.

We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug.

 

Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5]

DDL:
{code:java}
create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0); {code}

  was:
Example query:
{code:java}
spark.sql("select *, (select (count(*)) is null from t1 where t0.a = t1.c) from t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  {code}
In this subquery, count(*) always evaluates to a non-null integer value, so count(*) is null is always false. The correct evaluation of the subquery is always false.

We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug.

 

Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5]

DDL:
{code:java}
create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0); {code}


> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-43156
>                 URL: https://issues.apache.org/jira/browse/SPARK-43156
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Jack Chen
>            Priority: Major
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so count(1) is null is always false. The correct evaluation of the subquery is always false.
> We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5]
> DDL:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org