You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "agubichev (via GitHub)" <gi...@apache.org> on 2023/05/24 17:27:41 UTC
[GitHub] [spark] agubichev commented on pull request #41287: [SPARK-43760][SQL] Nullability of scalar subquery results

agubichev commented on PR #41287:
URL: https://github.com/apache/spark/pull/41287#issuecomment-1561664367

   > Is there some setting needed to materialize the issue? Using the queries you added as tests (in `scalar-subquery-predicate.sql` and `scalar-subquery-select.sql`), I am getting your expected results on the master branch:
   > 
   > ```
   > spark-sql (default)> select version();
   > 3.5.0 5f325ec917ced819a19911b472ebf7eb52010203
   > Time taken: 2.69 seconds, Fetched 1 row(s)
   > spark-sql (default)> select *
   >                    > from range(1, 3) t1
   >                    > where (select sum(c) from (
   >                    >         select t2.id * t2.id c
   >                    >         from range (1, 2) t2 where t1.id = t2.id
   >                    >         group by t2.id
   >                    >        )
   >                    > ) is not null;
   > 1
   > Time taken: 1.141 seconds, Fetched 1 row(s)
   > spark-sql (default)> select *
   >                    > from
   >                    > (
   >                    >  select t1.id c1, (
   >                    >                     select sum(c)
   >                    >                     from (
   >                    >                       select t2.id * t2.id c
   >                    >                       from range (1, 2) t2 where t1.id = t2.id
   >                    >                       group by t2.id
   >                    >                     )
   >                    >                    ) c2
   >                    >  from range (1, 3) t1
   >                    > ) t
   >                    > where t.c2 is not null;
   > 1	1
   > Time taken: 0.404 seconds, Fetched 1 row(s)
   > spark-sql (default)> 
   > ```
   
   Great catch! I used the wrong query tests, updated them now. In both cases the results on the master branch are wrong:
   ```
   select * from (
    select [t1.id](http://t1.id/) c1, (
     select [t2.id](http://t2.id/) c from range (1, 2) t2
     where [t1.id](http://t1.id/) = [t2.id](http://t2.id/)  ) c2
    from range (1, 3) t1 ) t
   where t.c2 is not null
   -- !query schema
   struct<c1:bigint,c2:bigint>
   -- !query output
   1	1
   2	NULL
   
   ```
   and 
   
   ```
   -- !query
   select *
   from range(1, 3) t1
   where (select [t2.id](http://t2.id/) c
          from range (1, 2) t2 where [t1.id](http://t1.id/) = [t2.id](http://t2.id/)
         ) is not null
   -- !query schema
   struct<id:bigint>
   -- !query output
   1
   2
   ```
   
   -- in both queries the second row should not be present.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org