You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bruce Robbins (Jira)" <ji...@apache.org> on 2023/10/21 16:58:00 UTC
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins updated SPARK-45580:
----------------------------------
Summary: Subquery changes the output schema of the outer query (was: Subquery changes the output schema of outer query)
> Subquery changes the output schema of the outer query
> -----------------------------------------------------
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.3.3, 3.4.1, 3.5.0
> Reporter: Bruce Robbins
> Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean column:
> {noformat}
> select *
> from t1
> where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
> );
> 1 false
> 2 false
> 3 true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
> select *
> from t1
> where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
> )
> limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
> at scala.Predef$.assert(Predef.scala:279)
> at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
> at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
> at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
> at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
> at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
> ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
> select c1
> from t2
> where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
> );
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org