You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bruce Robbins (Jira)" <ji...@apache.org> on 2023/10/17 21:14:00 UTC
[jira] [Created] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries
Bruce Robbins created SPARK-45580:
-------------------------------------
Summary: RewritePredicateSubquery unexpectedly changes the output schema of certain queries
Key: SPARK-45580
URL: https://issues.apache.org/jira/browse/SPARK-45580
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.5.0, 3.4.1, 3.3.3
Reporter: Bruce Robbins
A query can have an incorrect output schema because of a subquery.
Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
);
1 false
2 false
3 true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see this result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong).
However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
select *
from t1
where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
)
limit 1
)
from range(1);
java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
at scala.Predef$.assert(Predef.scala:279)
at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
select c1
from t2
where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
);
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org